Distributed Replicated Block Device (DRBD)
Contents
Backing Device
On each node, we give DRBD a block device to store its replicated physical blocks on top of, e.g.:
$ sudo mdadm \
--create /dev/md3 \
--metadata=1.1 \
--raid-devices=2 \
--spare-devices=0 \
--bitmap=internal \
--level=1 \
/dev/sd[ab]1
Note that we have used a RAID 1.1 superblock, which has the following notable properties:
- It will not be automatically assembled with auto-detection code via partition type fd. This feature of 0.90 superblocks is considered deprecated in favor of mdadm-based auto-assembly. This means that /etc/mdadm.conf should contain either the specifics of the volume to assemble, or e.g. DEVICE partitions so the information is determined at run-time. The Red Hat boot scripts use mdadm -As to assemble arrays in rc.sysinit.
- It supports (as we use here) a write-intent bitmap. This will significantly speed up resyncs because it knows which blocks of the device have been dirtied, and can ignore others.
- It resides at the beginning of the device.
Before proceeding, ensure that system reboots with a stable /dev name for backing the DRBD device, and does not require administrative intervention or udev trickery to create the udev node. The standard device name in the above example should not require anything special, but variants may.
Configuration File
Once the underlying RAID volumes are created on each node, DRBD needs to be set up. The first step in this process is to create the configuration file, which should be identical between the replicant nodes. DRBD administrative tools base their behavior out of the settings present in this file (/etc/drbd.conf).
DRBD has very sane (and safe) defaults; the configuration file can be quite barebones in our (simple) setup.
While the configuration may change with time, we'll describe some critical stanza from the time of this writing:
Specify Synchronous Replication
Specify to use the synchronous replication protocol for all our devices:
protocol C;
This causes all IOs on the DRBD device to block until both nodes have the data submitted to the respective underlying device layers, with whatever semantics are requested by the user. This means that if the filesystem is instructed to issue a write barrier operation, such as fdatasync(2), the write barrier will complete on both nodes before the IO returns from the system call.
This guarantees an exact copy on the standby node and and makes it impossible for data to have been uncomitted in the event of a crash.
Set Sync Rate
Increase the default (very slow) sync rate:
rate 110M;
This rate will be suitable to fill a gigabit interlink. The default tries to be nice on the network, and is more useful for asynchronous replication mode where we lazily copy data in the background and want to be gentle on other traffic. For our case, we could care less; more important is database performance. The writes which comprise the replication traffic -- reads are just be taken from the local copy -- are likely to be blocking user interaction in some way. Since we run fully synchronous, we need to hurry up and complete IO!
Specify Metadisk
Use internal metadisk:
meta-disk internal;
This keeps the DRBD metadata at the end of the underlying volume. NOTE: if lvm is used on top of the backing MD device (to facilitate snapshots, for example), we'd want to use the flexible-meta-disk directive instead, which would use a special companion LV to store the metadata instead. This way the data LV could be snapshotted and accessed underneath DRBD without having to be wary of the "extra" information at the end of the disk that wouldn't normally be visible if accessing through the DRBD layer.
Configure Split Brain
Configure the automatic split-brain recovery policies and handlers:
after-sb-0pri discard-least-changes; after-sb-1pri call-pri-lost-after-sb; after-sb-2pri call-pri-lost-after-sb; rr-conflict call-pri-lost; pri-on-incon-degr "/usr/local/sbin/sync-unmount-reboot"; pri-lost-after-sb "/usr/local/sbin/sync-unmount-reboot"; pri-lost "/usr/local/sbin/sync-unmount-reboot";
These follow a policy where the node with least changes is always discarded over the node with more changes, and the discard follows a "make secondary by reboot" policy. After reboot, the now-secondary node will reconnect and get updated blocks from the node that won the split brain algorithm discard-least-changes.
The script in question just do a shell-programmed sync, umount, reboot sysrq combo via kernel procfs interface on the losing node, regardless of whether it had promoted during the split, and discarding any changes that had occurred in the interim:
#!/bin/bash
#
# Perform a forced reboot as safely as possible
# (contrast with reboot -f)
#
mail -s drbd:forced-reboot root
ops=(
s # sync flush any dirty buffers immediately
u # umount forcibly unmount fses (keeps md devs clean)
b # reboot initiate hardware reboot
)
echo 1 > /proc/sys/kernel/sysrq
for op in ${ops[@]}
do echo $op > /proc/sysrq-trigger; sleep 5
done
Doing this as opposed to a reboot -f will have the additional advantage of avoiding lengthy resyncs at reboot (and simultaneously being vulnerable to single-disk failures).
More on the rationale behind our split-brain handling and why we have chosen to continue allowing writes during split-brain can be found in companion documents.
Fencing Policy
Set the fencing policy:
fencing dont-care;
We don't need to do IO fencing in our setup. Even if we were to do it, we would use the cluster layer, but we don't do it there either.
More on why we can do this safely in companion documents.
IO Error Handling
Configure the IO error handling policy:
on-io-error call-local-io-error;
DRBD can be set to go "diskless" in the event of a failure, marking the disk bad and halting IO, but transparently copying blocks from the remote node instead. Rather than do this, we just call a handler which notifies an operator, and disconnect, which will cause service to fail. The cluster layer should then fail over to the other node, which will have a good copy of our data.
Notification Handlers
Set up notification handlers:
local-io-error "/bin/mail -s drbd:local-io-error root"; before-resync-target "/bin/mail -s drbd:before-resync-target root"; after-resync-target "/bin/mail -s drbd:after-resync-target root"; split-brain "/bin/mail -s drbd:split-brain root";
These events can do more things, but we just want to notify an administrator.
Note that automatic split-brain handling is specified separately (above); the handler here is called only for manual split-brain, i.e. if no automatic policy is specified, or something went very wrong (which, of course, never happens with computers).
Replicated Device
Now that DRBD is configured, we can create the replicated volume on which to put the filesystem for the DBMS to store its datafiles.
DRBD provides two levels of shell access to its ioctl(2) interface. Depending on how hosed things are, it may sometimes be necessary to use the lower level tools drbdsetup(8) and drbdmeta(8), but the simple case requires only the use of the wrapper drbdadm(8) which invokes these tools on behalf of the user and hides their more complicated argument structure for simple tasks like volume creation.
Before we begin, we need to make sure the kernel side is available:
$ sudo modprobe drbd
Normally, the init scripts take care of this, i.e. service drbd start but those will have some difficulty since our volume is not yet created, but the configuration file is in place.
To create the volume, we must create the metadata, using the device name as specified in drbd.conf, which we'll call foodev in this document:
$ sudo drbdadm create-md foodev
This will write metadata corresponding to the drbd.conf specified configuration to the disk specified in the on section for this hostname, in the format or to the device specified by the meta-disk directive. It must be done on both nodes.
Next we bring the device up (do on both nodes):
$ sudo drbdadm up foodev
The nodes will be brought up in secondary state, connected to each other. Choose a node (either) to be the primary, and promote it, while overwriting the data on the other node:
$ sudo drbdadm -- --overwrite-data-of-peer primary foodev
The calling convention of drbdadm is strange: options before the -- are given to drbdadm itself, whereas those after are given to the wrapped program used to implement the command, i.e. drbdsetup or drbdmeta. (The value of drbdadm itself is debatable, but it's used in all the documentation, so we retain that convention here.)
That's it! The device has been created. The nodes will now do a full sync at whatever rate was specified in the syncer section of drbd.conf.
Maintenance procedures, how to display DRBD status information, or promote/demote nodes, are found in companion documents.
NOTE: if any subsequent changes are made to the drbd.conf file, it will be necessary to inform DRBD; it does not poll its configuration file for changes, and there is no "daemon" to do so in any case; drbd is run entirely from the kernel. See companion documents for more details on how to change the DRBD configuration while in operation.