Scott M. Mcdermott

UNIX Systems & Network Administrator
available for contract or salaried positions

Cluster Manager

Components

The Cluster is a single logical entity comprised of Cluster Nodes, which host Clustered Resources, on which are built Clustered Services to provide to users. Each Clustered Service is built on a collection of Resources which any Cluster Node can provide. In our two-node PostgreSQL Cluster, only one node provides database service, but either node can provide it.

The RedHat Cluster Suite Overview thoroughly describes the Cluster Component relationships in detail and graphically. This documentation is provided by the vendor.

The software packages used to provide each layer are:

openais Open Application Interface Specification is an abstraction of a platform on which Clustered Applications can run. This package contains daemons for internode communication between Cluster Members. OpenAIS (i.e. aisexecd) is the same communications implementation used by Pacemaker (i.e. Heartbeat v3), which is the SuSE Linux Cluster Resource Manager. It provides Node Membership, Availability, and Events such as Cluster Joins and Leaves, using IP Multicast.

cman Cluster Manager is, not surprisingly, the Cluster Management framework. It provides the "glue" between the notion of Cluster Membership (i.e. that part implemented by openais and the Clustered Resources that might run on a Cluster Member to provide a Clustered Service with Resources managed by the Resource Manager rgmanager

rgmanager Resource Manager deals in Resources to provide Services on a Cluster Node. This is the component that does service health checks and will implement the commands necessary to activate Resources on a particular Cluster Node which are required to provide a Clustered Service. This software is RedHat's implementation of Resource Management; compare to the SuSE package called Pacemaker for a different implementation.

ccs Cluster Configuration System is the component that provides a uniform, cluster-wide Configuration Tree. It maintains a hierarchical data structure in an XML based configuration file, changes to which are distributed between nodes. The XML tree can be accessed with a library interface from different languages, and via shell tools. Bindings are provided for at least C and Python (probably because these are used to implement Resource Management scripts and Fencing Agents (the latter being more relevant to physical shared storage used to implement clustered filesystems like GFS), which store their data in the CCS.

These components and their interworkings form a RedHat Cluster, which we use to provide Clustered DBMS service.

Initial Configuration

CCS (implemented with the daemon ccsd and its access libraries, e.g. libccs) stores its configuration in the XML file /etc/cluster/cluster.conf, which is identical cluster-wide, and an identical copy exists on every node. CCS parses the configuration file and makes its data available to the other cluster components, and the administrator.

ccs_tool will create the initial cluster configuration file to bootstrap the cluster membership; in particular, it can configure all the cman components of the cluster, and get it running. It will not configure resources, but is useful to get the initial version of the XML file prior to editing it by other means.

ccs_tool normally interfaces with ccsd, which will not be running if the cluster is not yet operational. The -C flag (shown in the procedure below) stops it from doing this by editing only the local copy of configuration in /etc/cluster/cluster.conf, and making no attempt to contact ccsd.

Creating a two-node cluster, we have to set the special two_node=1 and expected_votes=1 attributes to override the fact that at least two nodes must be present to establish quorum (see the discussion in the cman manual page). The corresponding XML stanza will be created as such:

$ sudo ccs_tool create -2 foocluster

Disabling IPv6

If running IPv4 only, long delays will be added to cluster startup if IPv6 is enabled but not working properly. Before proceeding:

  1. IPv6 should be disabled (i.e. by specifying /bin/true as the install method for ipv6 in /etc/modprobe.conf); and
  2. ccsd should be told not even to attempt the use of IPv6, with the line CCSD_OPTS='-4' in /etc/sysconfig/cman, but note that a manual start of ccsd would require -4 option directly on the command line. Normally, /etc/init.d/cman is used to start ccsd.

Someday, IPv6 may be commonplace on small private server segments, which causes trying v6 first make sense. Please let your grandchildren know that they'll need to remove -4 when that happens.

Adding Nodes

To proceed and add nodes, note one caveat with this tool: defining cluster members requires that they have fencing agents configured; it is a mandatory part of their instantiation via ccs_tool. This would be useful for a GFS cluster, but we are not using fencing in our cluster; see companion documents for a discussion.

Since we must define a fence agent, we just use the simple manual one (see fence_manual man page), which is useful during testing in any case:

$ sudo ccs_tool addfence -C human manual

Finally, we are able to add our nodes. Take note that the cluster node name should be identical to the node name as determined by gethostbyname(3). The value returned should map forward and backwards with gethostbyaddr(3). IMPORTANT: this address will be used for all cluster communications, so it must map to the private interface on the host node.

Add the nodes to the configuration file like so:

$ sudo ccs_tool addnode -C -n 1 -f human foonode1.fqdn
$ sudo ccs_tool addnode -C -n 2 -f human foonode2.fqdn

The result of these ccs_tool operations is a valid configuration file which can be used to bootstrap the cluster (shown in a later section).

Tidying

While the use of ccs_tool(8) is good to create the initial configuration file, there is a regrettable caveat: it cannot configure the rgmanager components, making it somewhat useless to make a complete cluster configuration. We can define and run a cluster, which, unfortunately, cannot be configured to provide services.

For this reason, we must now switch to hand-editing the configuration file. Before proceeding, it is useful to convert the ccs_tool generated /etc/cluster/cluster.conf to make it more human readable:

$ sudo xmltidy /etc/cluster/cluster.conf

This will make it easier to hand-edit the file from now on. For a discussion of caveats with hand-edits, see the notes in companion documents for performing maintenance on the cluster.

Disabling Fencing

Although we were forced to define fencing agents to add nodes with ccs_tool(8), we are not using fencing; a discussion of the rationale for that is found in companion documents.

To prevent fencing from causing problems and interfering (in fact, as configured, fencing will simply freeze the cluster until fence_ack_manual(8) is called) with normal cluster operation, we configure fenced(8) and cman(5) not to do any kind of fencing.

Telling the cluster that any members not found at start don't have to be fenced: Normally, when a cluster node becomes operational, it verifies that all configured cluster members are present and attempts to fence any that aren't. The reason for this is presented in detail in the fenced(8) manual page. The idea is that the cluster is either (1) all started at once, or (2) already running. There are some subtle race conditions (described in the fenced(8) manual page's "domain startup" section) which require this behavior, but all it does for us is cause problems.

The nodes can be configured to delay their fencing operations (and wait for members to join, i.e. during the initial cluster bootstrap) by setting the post_join_delay attribute in the fence_daemon element of the cluster.conf file. However, a simpler way is to set clean_start="1" in the fence_daemon element.

Lastly, do not join the fence domain on startup.

This obviates the need to really care about fencing. The LSB script supports this with FENCE_JOIN=no in /etc/sysconfig/cman (alongside the CCSD_OPTS='-4' that we already mention in a companion document on disabling IPv6).

Cluster Bootstrap

We now have everything set to start the cluster, and a master copy of the cluster.conf that will be used on the cluster nodes. It must be copied to all nodes prior to the first startup of ccsd.

Once distributed, we can start the cluster with the cman init script on all nodes:

$ sudo env - /sbin/service cman start

Use of the status commands, verification of operational status, and fixing errors using debugging with syslog, are described in companion documents.

STOP HERE. NOTE:: the cluster should be running before proceeding with resource configuration.

About Resource Agents

Having the Cluster Nodes in operation is useless in and of itself; before users can access our Clustered Database service, they must be defined and started. This is done by the Resource Manager rgmanager(8).

The Resource Manager deals with the Resources required to provide Services by way of Resource Agents. These are shell scripts (or really, any executables which follow a standard API defined in the Open Cluster Framework (OCF)). A discussion of OCF is out of scope for this document, and these scripts will not be dealt with directly, but a few important things to note:

  • resource agent scripts are in /usr/share/cluster/
  • many standard agents are provided with rgmanager
  • the drbd package installs its own resource agent
  • all other agents we use are provided by rgmanager
  • the service agent provides encapsulation, nesting

Each OCF Resource Agent is passed in some arguments (when called by rgmanager) using some variables in the environment vector defined in the OCF standard. There is also a single argument tiven to the Agent scripts when invoked, which serve as the "action" the script is to perform, passed via the argument vector. Actions are limited, but essentially the same as those defined in the LSB for init scripts, i.e. the standard System V "start", "stop" and "status" commands.

As far as the Cluster and rgmanager are concerned, Each Resource is defined in a cluster.conf file in the <rm> element (i.e. "'r'esource 'm'anager") and available via CCS; the agents themselves are defined by XML elements with their own name, and XML attributes that correspond to the parameters which will be passed into the script by rgmanager, via the environment variables defined in the OCF API.

Finally, metadata used by CCS -- to define what parameters can be configured for a resource -- are found in separate metadata files (NOTE: this is not part of OCF; it is particular to the RH implementation). Those files are also found either in /usr/share/cluster/ (in files with .metadata extension), or sometimes embedded in the scripts themselves, and emitted when a meta-data action argument is given to the script (instead of e.g. stop or start).

Example, for an IP address resource:

  • resource agent in /usr/share/cluster/ip.sh
  • configuration in /etc/cluster/cluster.conf via CCS
  • XPath to config would be //cluster/rm/ip
  • parameters would be in //cluster/rm/ip/@attribute
  • CCS metadata in agent script, i.e. ip.sh meta-data

The allowed parameters for configuring the standard Resource Agents distributed with RHCS can be found by examining the resource agent script itself (in /usr/share/cluster/, its corresponding metadata file, or (preferably) by looking at the table of Resource Agent Parameters found in the RHCS documentation.

That was definitely the most complicated piece to describe.

Identifying Resource Agents

Our PostgreSQL database service requires several resources to run; the service is merely a collection of said resources, grouped together in a structured hierarchy that defines their dependencies as parent-child relationships in XML, using the generic service resource for grouping.

The resources we must use to construct our PostgreSQL service are:

  1. DRBD device, to house the replicated data;
  2. filesystem for database files, which runs on DRBD;
  3. IP address to float, for DBMS access

These Resources combine to create a Service, which we use to provide our clustered PostgreSQL instance.

NOTE: there is a generic Resource Agent called "script" which can be used to run any LSB compliant init script (since the OCF actions "stop", "start", and "status" already map 1:1 to the LSB init script actions of the same name) from the Resource Agent. While there is an actual PostgreSQL Resource Agent with richer semantics (in particular, with the capability of passing some of the parameters to it), our implementation uses the LSB init script because we were having some trouble using the built-in Resource Agent with a concurrent PostgreSQL instance, and also, to allow the administrator to start and stop the service independently of the Cluster.

Create Resource Tree

The precise ordering and grouping semantics for the Resources is discussed in the RHCS documentation; see the reference manual on the RedHat site. Use that information to construct a resource tree in /etc/cluster/cluster.conf like so:

<?xml version="1.0"?>

<!--
-
- 2-node cluster for postgresql service
-
- TO EDIT:
-
-   1. MUST: increment config version cluster attribute
-   2. make all nodes aware: "ccs_tool update"   /
-      OR scp to all nodes                 ,____/
-                                          |
-->                                        V
<cluster name="dbcluster" config_version="123">

  <!--
  - fence daemon tries to fence any nodes that aren't part
  - of the cluster on startup, and we aren't using fencing
  - anyways, so just tell it to assume everyone is clean
  - despite the lack of membership
  -
  - XXX TODO do we still need this, now that we don't join
  - the fence domain via /etc/sysconfig/cman?
  -->
  <fence_daemon clean_start="1"/>

  <!--
  - we need these special attributes to allow quorum with
  - only one node present, which normally wouldn't be
  - possible
  -->
  <cman two_node="1" expected_votes="1" />

  <clusternodes>

    <clusternode nodeid="1" name="node0" votes="1">
      <fence> </fence>
    </clusternode>

    <clusternode nodeid="2" name="node1" votes="1">
      <fence> </fence>
    </clusternode>

  </clusternodes>

<!--
- RESOURCE TREE
-
- These are resource definitions for use by rgmanager.
- Not used by the cluster management layer, only by resource
- management that runs on top of it.
-
- See the RH cluster documentation and wiki for inheritance
- and ordering rules.
-->

  <rm>

    <!--
    - There are ways to trim duplication using 'ref='
    - attributes to refer to these 'resources' in some of
    - the 'service' member elements below.  We don't require
    - these, but the stanza must be present.
    - XXX TODO really?
    -->
    <resources />

    <!--
    - The master service is the database itself.
    - Make sure that we auto-relocate to the other node if a
    - node fails, instead of trying to restart it on the
    - node that had service during the failure.
    -->
    <service name="database" recovery="relocate">

      <!--
      - Comprising that service are two subservices:
      -
      -   - the database host (i.e. host/ip/drbd/fs)
      -   - the actual DBMS instance (postgres)
      -
      - We define these separately so we can get ordering.
      - Since drbd is non-typed, it's run only after all
      - typed resources.  Normally, if we used the built-in
      - postgres resource, it would also be untyped, so just
      - the order of appearance would dictate their start
      - order.  However, we're using the "script" agent to
      - invoke postgres, so it ends up running before drbd.
      - Even though "script" is always run last among typed
      - resources, all typed resources are run before
      - untyped ones.
      -->

      <!--
      - first service to setup is the "host" platform for
      - the dbms to run on
      -->
      <service name="dbnode">
        <ip address="10.80.4.200/22" monitor_link="1" />
        <drbd name="dbdev" resource="dbdev">
          <!--
          - filesystem depends on the backing store
          -->
          <fs
            name="dbfs"
            device="/dev/drbd0"
            mountpoint="/var/lib/pgsql/clusterdata"
            fstype="ext3"
            options="noatime"
          />
        </drbd>
      </service>

      <!--
      - Once the host platform has been established, we'll
      - now be able to run the actual DBMS service itself.
      -->
      <service name="dbms">
        <script
          name="postgresql"
          file="/etc/init.d/postgresql-cluster"
        />
      </service>
    </service>
  </rm>
</cluster>

DO NOT start rgmanager yet! Before proceeding to automate the resources, each component of the tree must be debugged and tested manually, as in the section below.

Verify Resource Tree

Each resource should be thoroughly tested on BOTH NODES before starting rgmanager(8). The rg_test(8) tool can be used to learn how the resources have parsed, their interrelationships and dependencies.

To show the resource tree and check for parse errors, typos, or a disconnect between your idea of the resource tree and the one the cluster thinks it has, use e.g.:

$ sudo rg_test test cluster.conf | grep -iA 999 tree
Running in test mode.
Loaded 23 resource rules
=== Resource Tree ===
service {
  name = "database";
  autostart = "1";
  hardrecovery = "0";
  exclusive = "0";
  nfslock = "0";
  nfs_client_cache = "0";
  recovery = "restart";
  depend_mode = "hard";
  max_restarts = "0";
  restart_expire_time = "0";
  priority = "0";
  service {
    name = "dbnode";
    autostart = "1";
    hardrecovery = "0";
    exclusive = "0";
    nfslock = "0";
    nfs_client_cache = "0";
    recovery = "restart";
    depend_mode = "hard";
    max_restarts = "0";
    restart_expire_time = "0";
    priority = "0";
    ip {
      address = "10.80.4.100/22";
      monitor_link = "1";
      nfslock = "0";
    }
    drbd {
      name = "dbdev";
      resource = "dbdev";
      fs {
        name = "dbfs";
        mountpoint = "/var/lib/pgsql/clusterdata";
        device = "/dev/drbd0";
        fstype = "ext3";
        nfslock = "(null)";
        options = "noatime";
      }
    }
  }
  service {
    name = "dbms";
    autostart = "1";
    hardrecovery = "0";
    exclusive = "0";
    nfslock = "0";
    nfs_client_cache = "0";
    recovery = "restart";
    depend_mode = "hard";
    max_restarts = "0";
    restart_expire_time = "0";
    priority = "0";
    script {
      name = "postgresql";
      file = "/etc/init.d/postgresql-cluster";
      service_name = "dbms";
    }
  }
}

All the parent/child relationships are shown here. Iteratively update the cluster.conf file and re-test until it looks exactly like it should. If necessary:

$ sudo rg_test rules

can be used for super-verbose debugging of the scripts themselves. This would be most useful for writing one's own resource agent, for example.

Verify Resource Ordering

Now that we think the tree is correct as known to the cluster, that should be verified; the rules for resource start order are quite complex (see the RHCS documentation). Verifying before the failure occurs will ensure that the cluster has the right behavior when a real failure event occurs.

Testing at this level will not actually perform any actions, but it will show exactly the order in which everything would have been run, e.g.:

$ sudo rg_test noop cluster.conf start service database
Running in test mode.
Starting database...
[start] service:database
[start] service:dbnode
[start] ip:10.80.4.100/22
[start] drbd:dbdev
[start] fs:dbfs
[start] service:dbms
[start] script:postgresql
Start of database complete

$ sudo rg_test noop cluster.conf stop service database
Running in test mode.
Stopping database...
[stop] script:postgresql
[stop] service:dbms
[stop] fs:dbfs
[stop] drbd:dbdev
[stop] ip:10.80.4.100/22
[stop] service:dbnode
[stop] service:db
Stop of database complete

Test Resource Ordering

Testing the ordering amounts to actually running the Resource Agent actions defined for each "service" abstraction defined in the Resource Tree, and verifying that dependencies are started before their dependees, in the correct order, e.g.:

$ sudo rg_test test cluster.conf start service database
Running in test mode.
Starting database...
<debug>  Link for bond0: Detected
<info>   Adding IPv4 address 10.80.4.100/22 to bond0
<debug>  Sending gratuitous ARP:
         10.80.4.100 00:10:18:24:3f:7b brd ff:ff:ff:ff:ff:ff
<info>   mounting /dev/drbd0 on /var/lib/pgsql/clusterdata
<debug>  mount -t ext3 -o noatime /dev/drbd0 /var/lib/pgsql/clusterdata
<info>   Executing /etc/init.d/postgresql-cluster start
Starting postgresql-cluster service: [  OK  ]
Start of database complete

$ sudo rg_test test cluster.conf stop service database
Running in test mode.
Stopping database...
<info>   Executing /etc/init.d/postgresql-cluster stop
Stopping postgresql-cluster service: [  OK  ]
<info>   unmounting /var/lib/pgsql/clusterdata
<info>   Removing IPv4 address 10.80.4.100/22 from bond0
Stop of database complete

Subcomponents of the master services can be tested and debugged as needed; only the "outermost" (i.e. the entire clustered DBMS service) were shown here.

Make Cluster Permanent

Once the full Resource Tree and its interrelationships have been tested and verified on both nodes, we know that both the Cluster layer and the Resource layer are now complete.

We can now automate the process, making our cluster "officially" operational:

$ sudo chkconfig cman on
$ sudo chkconfig rgmanager on
$ sudo cman start
$ sudo rgmanager start

Reboots, etc should be tested. The results as discovered in tests are discussed in companion documents.