Scott M. Mcdermott

UNIX Systems & Network Administrator
available for contract or salaried positions

PostgreSQL

Filesystem

The PostgreSQL files that back the DBMS data storage are just ordinary files on a typical ext3 filesystem which happens to be housed on a DRBD device. We use a typical filesystem creation procedure:

$ sudo mkfs -t ext3 /dev/drbd0
$ sudo tune2fs -c 0 -i 0 /dev/drbd0

We use a filesystem label as the source device for the mount, so as not to tie to any particular device name; the Cluster Resource Manager can use labels, device names, or UUIDs (same as mount):

$ sudo e2label /dev/drbd0 foofs

If we were to mount this on ~postgres/, it would disallow other parallel instances of PostgreSQL from running on the system. Instead, we make a special directory for the datafiles only (i.e., corresponding to a new $PGDATA) to back our clustered database instance. This allows the usual $PGDATA found at ~postgres/data/ to be used for the parallel instance, while ours has a different name alongside:

$ sudo mkdir ~postgres/clusterdata

Finally, we associate this location with the DRBD device permanently in the system's persistent mount table, taking care that it does not mount automatically at system boot time, and without the atime updates which are known to kill performance with some database workloads; the mount table entry looks like so:

LABEL=foofs /var/lib/pgsql/clusterdata ext3 noauto,noatime 0 0

The database filesystem can now be mounted and will be ready to make a new PostgreSQL instance:

$ sudo mount -L $label

Alternatively, files can be copied from an existing instance as appropriate (which is how we implemented ours, since it was a cutover of the Staging database)

Service Address

Our clustered database will use a "virtual" IP address which "floats" between hosts, depending on which one has the service assigned to it by the cluster at any one time. This IP address is simply an IP alias on the interface assigned to the backside (private) internode segment. Obviously, only one node will have the service IP running on it at any given time.

Although this IP address is a Clustered Resource (and is thus added/removed from the node by the Resource Manager based on its assigned role in the cluster), we still configure the system to know about it with the standard ifup and ifdown scripts. This will allow the admin to take the interface up and down with standard tools, and suits the least-surprise principle when a foreign administrator has to work on the system.

NOTE: ensure that the system init scripts will not start the interface on boot. Only the cluster resource manager should be taking the interface up and down on any kind of automatic basis.

An example interface config file, ifcfg-bond0:cludbif:

DEVICE=bond0:cludbif
ONBOOT=no
ONPARENT=no
IPADDR=10.80.4.100
NETWORK=10.80.4.0
NETMASK=255.255.252.0

Note the following caveats:

  • File name and the device name must match. Although the scripts are supposed to determine one from the other, they don't seem to do so.
  • Interface does not come up on boot, or with its parent interface (see above). Only the Cluster layer should be plumbing this interface automatically.
  • The network must be the same as that of the parent interface. While this is not really a requirement of IP aliasing (doing it manually via ip addr add works just fine on totally separate subnets), it does not seem to be allowed int he cluster manager, and there was some difficulty even using the init script without this (superfluous information had to be specified, like NETWORK where IPADDR and NETMASK or PREFIX alone would suffice). To make our lives easier, we just used the same subnet as the private interface, which works in both the rgmanager resource script (ip.sh), the /sbin/ifup scripts, and is well tested by others.
  • We used an interface label instead of a number. This is not at all necessary, but makes it easier for the administrator to identify the purpose of the interface from its name and in the /sbin/ip command output. Interface labels are supported in the kernel since early 2.6. NOTE: the label name cannot exceed 9 characters, although it's not clear where this limitation comes from at the time of this writing.

The address can now be used for the clustered PostgreSQL instance.

Initialization Scripts

Our PostgreSQL instance will need a separate init script to allow it to be configured separately from any other postgres instances on the machine.

The standard LSB script for postgres can be used almost unchanged; it has a feature whereby it uses its invocation name (i.e. $0) as a filename to source in /etc/sysconfig/pgsql/, to instantiate variables such as $PGDATA and $PGHOME. However, a minor change is necessary for the init script to support the LSB status command properly with multiple databases:

$ sudo sed -i 's/\${PGPORT}.pid/${NAME}.${PGPORT}.pid/g' \
  /etc/init.d/postgresql

Once this is done, a simple symlink to the usual /etc/init.d/postgresql can be used to support our new instance. We name the instance appropriately as postgresql-cluster:

$ cd /etc/init.d
$ sudo ln -s postgresql postgresql-cluster

Otherwise, the PID file is overwritten by every new instances that's created.

XXX TODO the script must be modified so the "stop" command returns 0 if the database is stopped already and a stop was requested. Otherwise, catastrophic failure of the cluster will occur.

XXX TODO just use the stock postgres8 resource agent instead! XXX here's why we don't: that does not support multi-instance on same host

With our clustered data directory as created in earlier sections, we end up with a /etc/sysconfig/pgsql/postgresql-cluster file having these contents:

PGDATA=/var/lib/pgsql/clusterdata
PGLOG=/var/lib/pgsql/pgstartup-cluster.log

Note the separate logfile. We don't want to collide with any others stored in ~postgres/.

As with the Service Address, this is a Clustered Resource, and we aren't going to be starting or stopping the instance automatically (or even manually, outside the cluster's own administrative interface) except during the testing prior to cluster automation. For this reason, we add the init script, but disable its automatic startup:

$ chkconfig --add postgresql-cluster
$ chkconfig postgresql-cluster off

CAUTION: using a symlink for the init script means that if chkconfig is ever called to reconfigure services based on the chkconfig: line in the init script, it will default to the service being on.

Kernel IPC Limits

Before PostgreSQL can start, and in order to support multiple PostgreSQL instances on the same machine, the system's limits on System V Shared Memory resources may have to be increased via the /proc/ interface. These values can be made persistent in /etc/sysctl.conf.

If there is an existing single instance, simply double those values. The key names are:

kernel.shmmax
kernel.shmall
kernel.shmmni
kernel.sem

Before doubling these values during testing, multiple instances of PostgreSQL would not start. Certain values in postgresql.conf (such as allowed number of concurrent connections) influence how these resources are allocated at the time the DBMS starts.

PostgreSQL Configuration

The configuration for a given PostgreSQL instance typically lives in the $PGDATA directory, which in our case is going to be set to the special DRBD-backed filesystem. This allows both the data files and the configuration to be shared automatically between nodes by nature of it being a replicated data store.

As the final step in the procedure to instantiate our new PostgreSQL instance, we must change a few variables in the replicated database instance configuration file ($PGDATA/postgresql.conf):

  • listen_address must be set to the clustered service IP we set up in the Service Address section. This allows either instance to start and bind to this address only. Remember that another instance of PostgreSQL may be running on the machine, so we must use a different listen_address if we don't want to specify a different $PGPORT everywhere.
  • unix_socket_directory must be modified from the usual /tmp. We used a directory in /var/run/ corresponding to the instance name, i.e. /var/run/postgresql-cluster. NOTE: the directory must exist when PostgreSQL starts. The reason we need this: PostgreSQL constructs the socket filename by using only the port number; it cannot support multiple instances of PostgreSQL bound to the same port but different IPs of the machine. Each instance will need to use a separate directory for the UNIX domain sockets, allowing them to have identical filenames without a pathname collision.

PostgreSQL is now configured to run on either node, depending on which the Cluster Resource Manager has decided should own the service. The DRBD layer will expose the block device on the service node, and the Cluster Resource Manager will mount the filesystem and call the PostgreSQL LSB script. For testing, the service should be started as such:

$ sudo env - /sbin/service postgresql-cluster start

This should be attempted on both nodes before moving on to the Cluster Manager configuration.