Creating Custom OCF Resource Agents

A Simple Guide for Creating OCF Resource Agents with Pacemaker and Corosync

HA Clustering Explained

For those not familiar with High Availability resources & services or Cluster Computing, in this section we will go over a brief summary and explanation of OCF RA’s before we dive in to creating our own. In cluster networking multiple servers are inter-connected together on a network to provide improved delivery of services such as with HA.

The servers (referred to as nodes) are all managed by a Cluster Resource Manager (CRM), which as you may have guessed, manages cluster resources. High Availability (HA) refers to the ability to provide a resource or service with minimal to zero downtime. In layman’s terms, whatever client wants, client gets, and the store is open 24/7/365. In our case we will be using Pacemaker as our CRM.

Now each node has to have a way to communicate with other nodes in cluster for this to work, when a resource fails, or a node goes down, the CRM needs to be notified to make other resources available. This is where Corosync steps in. Corosync the communication layer that provides Pacemaker with the status of resources and nodes on the cluster. The information that Pacemaker receives can communicate many complex situations and scenarios that Pacemaker can deal with in a variety of ways.

OCF Resource Agents

Where do the Resource Agents fit into this picture you say? Glad you asked, the RA’s are deployed by pacemaker to interface with each resource that needs managed, using Corosync’s communication layer. The protocol used by the RA’s is known as the Open Cluster Framework (OCF) and defines how RA’s should be implemented and requirements to create one. Most of the time they are implemented in a shell script, but they can be written in any programming language needed. More info on HA cluster computing and OCF protocol implementation can be found at Cluster Labs and if you have any further questions feel free to email us at Flyball Labs That is it for the lecture today, on with the chlorophyll!

How to Create Your Own OCF RA

Prerequisites

This guide assumes that you have Corosync and Pacemaker setup on muli-node cluster (atleast 2).

You must also have access to the remote nodes via ssh / scp to transfer the completed files. Note that every node will need a copy of the RA script.

Design

We will be creating a simple Resource Agent that runs a supplied script to perform a health check. Our health check will simply check if the node has any hard drives that are 95% used or more. If the health check does not pass, then we must shut the node down and perform a failover to another node (including our RA).

Instructions

First we are going to grab a template OCF script. You will need to install git cmd line package if you haven’t installed already. Change dir into that directory.
```
    git clone https://github.com/russki/cluster-agents.git &&
    cd cluster-agents
```
Make a copy and rename the RA something pertinent like “HealthAgent”.
```
    cp generic-script health-agent
```
Make sure to mae the copy executable.
```
    chmod +x HealthAgent
```
Change names in the agent script to the one we renamed it to.
```
    sed -i 's/generic_script/health_agent/g' health-agent
    sed -i 's/generic-script/health-agent/g' health-agent
    sed -i 's/Generic Script/Health Agent/'  health-agent
```
We now have a Resource Agent that can run and manage a shell script. Arguments will be passed through the agent to the calling script that we supply it. Now we need to copy it over to each of the cluster nodes, install and test it.

We can accomplish this fairly simple with a bash script such as the one below. Note I just whipped this up so may need tweaked.

    #!/bin/bash
    #
    # Assuming user is the same on each node
    #

    USAGE() {
        echo "./rlogin.sh    ..."
    }

    if [ "$#" -eq 0 ]; then
        USAGE
        exit 1
    fi

    read -p "Username: " user

    for server in "$@"
    do
      scp health-agent "$user"@"$server":/usr/lib/ocf/resource.d/heartbeat
    done

    exit 0

Create a shell script for the resource manager to run, which will check the HDD status on the node.
Here is a sample script that I use across my cluster nodes, that checks the HDD capacity, and returns status.


    #!/bin/bash
    # Check if physical hard drives are full
    # Return 1 if any drive is over tolerance (percent)
    # Return 0 if all drives are under tolerance (percent)

    TOLERANCE=5 #default to 1)print $0}')
    TEST="$((100 - $TOLERANCE))"

    while read -r line; do
        if [[ "$line" -gt "$TEST" ]]; then
            echo "drive is over tolerance"
            exit 1
        fi
    done <<< "$USED"

    echo "all drives passed"
    exit 0

Save the bash script, make it executable, then SCP it over to all the servers. Then connect to each node, perform tests, and add resource if everything checks out.

Test your agent using ocf-tester to catch any mistakes (on remote nodes). We want to do this on all nodes to ensure they are configured / have correct dependencies to run our script. Node: provide it with full path to script.
```
    ocf-tester -n /usr/lib/ocf/resource.d/heartbeat/health-agent
```
Try executing agent in a shell and make sure it executes w/o any major errors. Again like previous example we should check each of the nodes.
```
    export OCF_ROOT=/usr/lib/ocf; bash -x /usr/lib/ocf/resource.d/heartbeat/health-agent start
```

Add the resource to Pacemaker using pcs.
This should be done on all nodes in cluster you want that resource agent to monitor on.

    pcs resource create health-agent ocf:heartbeat:health-agent script="full/path/to/script/to/run" state="/dev/shm" alwaysrun="yes" op monitor interval=60s

To see status of cluster:
```
    pcs status
```
Which should show your newly added resource agent and some additional status information that can be helpful in debugging.

To show detailed information about only the health-agent resource run the following:
```
    pcs resource show health-agent
```
You can also run resource agent action cmds with the same cmd. This can be useful to test if a certain action isn’t working properly.
```
    pcs resource debug-start health-agent
    pcs resource debug-stop health-agent
    pcs resource debug-monitor health-agent
```

Summary

I hope that helped clear up a few gotchas about cluster resource management. The OCF protocol is not very intuitive and small changes can completely ruin your previously working agents. Make sure to test, test, and test again. There is also a more sophisticated testing framework for OCF resource agents called: ocft which allows for testing environments, creating test cases, and much more. You can find that tool at Linux HA Hope you enjoyed this post, more awesomeness to follow!

For More Information

For more information, questions and comments visit us at:

dOpenSource
Flyball Labs

Written By:

DevOpSec
Software Engineer
Flyball Labs