Active/Active High Availability Pacemaker Cluster with GFS2 and iSCSI Shared Storage on CentOS 7

We are going to build a three-node active/active HA cluster using Pacemaker and Corosync.

The Plan

Our aim is to build a three-node (never trust clusters without odd numbers of voters) active/active GFS2 cluster using Pacemaker and Corosync.

We have three CentOS 7 virtual machines on VMware (ESXi), named pcmk01, pcmk02 and pcmk03.

The convention followed in the article is that [ALL]# denotes a command that needs to be run on all cluster nodes.

TL;DR

The article covers setup and configuration of:

  1. Passwordless SSH authentication
  2. Pacemaker with Corosync
  3. iSCSI Multipath Device (NetApp)
  4. STONITH (VMware fencing device)
  5. Pacemaker for GFS2
  6. LVM

Notes

Red Hat does not support using GFS2 for cluster file system deployments greater than 16 nodes.

The gfs2_tool command is not supported in RHEL 7.

Make sure that the clocks on the GFS2 nodes are synchronised. Unnecessary inode time-stamp updating severely impacts cluster performance. NTP configuration is not covered in this article.

iSCSI server installation and setup is beyond the scope of this article. It is assumed that you have a functional shared storage server in place already. You may want to check this post for an iSCSI target configuration on RHEL 7, you may find it helpful.

Software

Software used in this article:

  1. CentOS Linux release 7.2.1511 (Core)
  2. pacemaker-1.1.13
  3. corosync-2.3.4
  4. pcs-0.9.143
  5. resource-agents-3.9.5
  6. device-mapper-multipath 0.4.9
  7. gfs2-utils 3.1.8

Networking and Firewall Configuration

IP Addresses and Hostnames

The following networks will be used:

  1. 10.247.50.0/24 – LAN with access to the Internet,
  2. 172.16.21.0/24 – non-routable cluster heartbeat vlan for Corosync,
  3. 10.11.0.0/16 – non-routable iSCSI vlan,
  4. 10.12.0.0/16 – non-routable iSCSI vlan.

Hostnames and IPs as defined in /etc/hosts file:

10.247.50.10 vcentre
10.247.50.211 pcmk01 vm-pcmk01
10.247.50.212 pcmk02 vm-pcmk02
10.247.50.213 pcmk03 vm-pcmk03
172.16.21.11 pcmk01-cr
172.16.21.12 pcmk02-cr
172.16.21.13 pcmk03-cr
10.11.0.147 pcmk01-iscsi1
10.11.0.148 pcmk02-iscsi1
10.11.0.149 pcmk03-iscsi1
10.12.0.147 pcmk01-iscsi2
10.12.0.148 pcmk02-iscsi2
10.12.0.149 pcmk03-iscsi2

The vcentre record above points to our VMware vCenter server, which is on 10.247.50.10.

We have set the following hostnames:

[pcmk01]# hostnamectl set-hostname pcmk01
[pcmk02]# hostnamectl set-hostname pcmk02
[pcmk03]# hostnamectl set-hostname pcmk03

Iptables

This article uses Iptables firewall. Note that CentOS 7 utilises FirewallD as the default firewall management tool.

Replace FirewallD service with Iptables:

[ALL]# systemctl stop firewalld.service
[ALL]# systemctl mask firewalld.service
[ALL]# systemctl daemon-reload
[ALL]# yum install -y iptables-services
[ALL]# systemctl enable iptables.service
[ALL]# service iptables save

We use the following firewall rules:

# iptables -S
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -s 10.0.0.0/8 -p tcp -m tcp --dport 22 -m state --state NEW -j ACCEPT
-A INPUT -s 10.0.0.0/8 -p tcp -m tcp --dport 2224 -m state --state NEW -j ACCEPT
-A INPUT -s 172.16.21.0/24 -d 172.16.21.0/24 -m comment --comment Corosync -j ACCEPT
-A INPUT -s 10.11.0.0/16 -d 10.11.0.0/16 -m comment --comment iSCSI_1 -j ACCEPT
-A INPUT -s 10.12.0.0/16 -d 10.12.0.0/16 -m comment --comment iSCSI_2 -j ACCEPT
-A INPUT -p udp -m multiport --dports 67,68 -m state --state NEW -j ACCEPT
-A INPUT -p udp -m multiport --dports 137,138,139,445 -j DROP
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -j LOG --log-prefix "iptables_input "
-A INPUT -j REJECT --reject-with icmp-port-unreachable
-A FORWARD -j LOG --log-prefix "iptables_forward "

Sysctl and SELinux

Open /etc/sysctl.conf for editing and place the following to disable IPv6:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1

Also cause a node to panic if an oops (an in-kernel segfault or GFS assertion failure) happens:

kernel.panic_on_oops = 1

This should be turned on automatically by GFS. Load values from file:

[ALL]# sysctl -p

SELinux is set to enforcing mode.

1. Configure Passwordless SSH Authentication Between Cluster Nodes

Install rsync package, generate an SSH keypair and distribute it accross cluster nodes.

[ALL]# yum install rsync
[pcmk01]# ssh-keygen -b 2048 -t rsa -C "root@pcmk-nodes" -f ~/.ssh/id_rsa
[pcmk01]# mv ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys

Sync with ourselves to get an ECDSA key fingerprint stored into known_hosts, then sync with other cluster nodes:

[pcmk01]# rsync -av /root/.ssh/* pcmk01:/root/.ssh/
[pcmk01]# rsync -av /root/.ssh/* pcmk02:/root/.ssh/
[pcmk01]# rsync -av /root/.ssh/* pcmk03:/root/.ssh/

Optional: Sync Various Configuration Between Cluster Nodes

Here is a neat rsync_nodes.sh script to keep nodes in sync:

#!/bin/bash
# written by Tomas (http://www.lisenet.com)
# 07/02/2016 (dd/mm/yy)
# copyleft free software
# Simple script to keep cluster nodes in sync
#
LOGFILE=""$HOME"/rsync_nodes.log";
# Nodes to keep in sync.
NODE1="pcmk01";
NODE2="pcmk02";
NODE3="pcmk03";
# Files and directories to sync.
# More files can be added as required.
FILE1="/etc/hosts";
FILE2="/etc/sysconfig/iptables";
FILE3="/etc/sysctl.conf";
FILE4="/etc/security/limits.conf";
FILE5="/etc/multipath.conf";
DIR1="/etc/yum.repos.d/";
#
echo "Logfile is: "$LOGFILE"";
echo "Syncing "$FILE1"";
rsync -av "$FILE1" "$NODE2":"$FILE1" >>"$LOGFILE" 2>&1;
rsync -av "$FILE1" "$NODE3":"$FILE1" >>"$LOGFILE" 2>&1;
echo "Syncing "$FILE2"";
rsync -av "$FILE2" "$NODE2":"$FILE2" >>"$LOGFILE" 2>&1;
rsync -av "$FILE2" "$NODE3":"$FILE2" >>"$LOGFILE" 2>&1;
echo "Syncing "$FILE3"";
rsync -av "$FILE3" "$NODE2":"$FILE3" >>"$LOGFILE" 2>&1;
rsync -av "$FILE3" "$NODE3":"$FILE3" >>"$LOGFILE" 2>&1;
echo "Syncing "$FILE4"";
rsync -av "$FILE4" "$NODE2":"$FILE4" >>"$LOGFILE" 2>&1;
rsync -av "$FILE4" "$NODE3":"$FILE4" >>"$LOGFILE" 2>&1;
echo "Syncing "$FILE5"";
rsync -av "$FILE5" "$NODE2":"$FILE5" >>"$LOGFILE" 2>&1;
rsync -av "$FILE5" "$NODE3":"$FILE5" >>"$LOGFILE" 2>&1;
echo "Syncing "$DIR1"";
rsync -av "$DIR1" "$NODE2":"$DIR1" >>"$LOGFILE" 2>&1;
rsync -av "$DIR1" "$NODE3":"$DIR1" >>"$LOGFILE" 2>&1;
exit 0;

2. Install Pacemaker and Corosync

We want to install VMware tools first (skip this line if running on a non VMware platform):

[ALL]# yum install open-vm-tools

The pcs will install pacemaker, corosync and resource-agents as dependencies.

[ALL]# yum install -y pcs

Optionally, install policycoreutils-python for SELinux management:

[ALL]# yum install -y policycoreutils-python

Set up a password for the pcs administration account named hacluster:

[ALL]# echo "passwd" | passwd hacluster --stdin

Start and enable the service:

[ALL]# systemctl start pcsd.service
[ALL]# systemctl enable pcsd.service

Configure Corosync

Authenticate as the hacluster user. Note that authorisation tokens are stored in the file /var/lib/pcsd/tokens.

[pcmk01]# pcs cluster auth pcmk01-cr pcmk02-cr pcmk03-cr -u hacluster -p passwd
pcmk01-cr: Authorized
pcmk02-cr: Authorized
pcmk03-cr: Authorized

Generate and synchronise the Corosync configuration:

[pcmk01]# pcs cluster setup --name gfs_cluster pcmk01-cr pcmk02-cr pcmk03-cr

Start the cluster on all nodes:

[pcmk01]# pcs cluster start --all

Enable cluster services to start on boot:

[ALL]# pcs cluster enable --all

Our Pacemaker cluster is now up and running, but with no resources configured yet.

Cluster should have quorum.

[pcmk01]# corosync-quorumtool -s
Quorum information
------------------
Date:             Sat Feb 6 15:41:52 2016
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          1
Ring ID:          996
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
         1          1 pcmk01-cr (local)
         2          1 pcmk02-cr
         3          1 pcmk03-cr

Cluster manager web UI can be accessed on https://pcmk01:2224/.

3. iSCSI Client Installation and Configuration

Configure Device-Mapper Multipath and iSCSI Initiator

[ALL]# yum install device-mapper-multipath iscsi-initiator-utils

The default settings for DM Multipath are compiled in to the system and do not need to be explicitly set in the /etc/multipath.conf file.

However, the default value of path_grouping_policy is set to failover, so depending on setup, we may need to edit the /etc/multipath.conf file and change it to acoordingly.

As we usually have an sda disk already in use on the servers, we’re blacklisting it.

[ALL]# cat << EOL > /etc/multipath.conf
defaults {
        user_friendly_names yes
        find_multipaths yes
}
blacklist {
        devnode "sda"
        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
        devnode "^hd[a-z]"
        devnode "^cciss!c[0-9]d[0-9].*"
}
devices {
        device {
                vendor               "NETAPP"
                product              "NewFiler"
                path_grouping_policy multibus
                path_selector        "round-robin 0"
                failback             immediate
        }
}
EOL

Enable and start a multipath service:

[ALL]# systemctl enable multipathd.service
[ALL]# systemctl start multipathd

Configure an iSCSI initiator name:

[ALL]# echo "InitiatorName=iqn.1994-05.com.redhat:$(hostname)" >/etc/iscsi/initiatorname.iscsi

Enable and start an iscsi service:

[ALL]# systemctl enable iscsi.service
[ALL]# systemctl start iscsi

Discover the newly created targets, where 10.11.0.5 is the IP of our NetApp SAN:

[ALL]# iscsiadm -m discovery -t sendtargets -p 10.11.0.5:3260

Set the logon as automatic:

[ALL]# iscsiadm -m node -L automatic

NetApp Data ONTAP SAN: Create a LUN on and iSCSI Server

The commands below are specific to the NetApp Data ONTAP SAN which we have in use, and are therefore not explained in detail due to questionable benefit to the reader coming accross this article (emphasis on Pacemaker + GFS2). These are mostly for our own future reference.

SAN> vol create iSCSI_PCMK_MySQL_Test_Cluster -s none aggr1_fcal 5g
SAN> vol autosize iSCSI_PCMK_MySQL_Test_Cluster -m 10g -i 1g on
SAN> snap sched iSCSI_PCMK_MySQL_Test_Cluster 0 0 0
SAN> snap reserve iSCSI_PCMK_MySQL_Test_Cluster 0
SAN> vfiler add vfiler2 /vol/iSCSI_PCMK_MySQL_Test_Cluster

SAN> igroup create -i -t linux PCMK_MySQL_Test_Cluster iqn.1994-05.com.redhat:pcmk01
SAN> igroup add PCMK_MySQL_Test_Cluster iqn.1994-05.com.redhat:pcmk02
SAN> igroup add PCMK_MySQL_Test_Cluster iqn.1994-05.com.redhat:pcmk03

SAN> igroup show PCMK_MySQL_Test_Cluster
    PCMK_MySQL_Test_Cluster (iSCSI) (ostype: linux):
        iqn.1994-05.com.redhat:pcmk01 (logged in on: IFGRP1-12, IFGRP1-11)
        iqn.1994-05.com.redhat:pcmk02 (logged in on: IFGRP1-12, IFGRP1-11)
        iqn.1994-05.com.redhat:pcmk03 (logged in on: IFGRP1-12, IFGRP1-11)

SAN> lun create -s 5g -t linux /vol/iSCSI_PCMK_MySQL_Test_Cluster/iSCSI_PCMK_MySQL_Test_Cluster.lun
SAN> lun set reservation /vol/iSCSI_PCMK_MySQL_Test_Cluster/iSCSI_PCMK_MySQL_Test_Cluster.lun disable
SAN> lun map -f /vol/iSCSI_PCMK_MySQL_Test_Cluster/iSCSI_PCMK_MySQL_Test_Cluster.lun PCMK_MySQL_Test_Cluster

SAN> lun show -v /vol/iSCSI_PCMK_MySQL_Test_Cluster/iSCSI_PCMK_MySQL_Test_Cluster.lun
/vol/iSCSI_PCMK_MySQL_Test_Cluster/iSCSI_PCMK_MySQL_Test_Cluster.lun    5.0g (5346689024)    (r/w, online, mapped)
                Serial#: 80AQDCKHG1AJ
                Share: none
                Space Reservation: disabled
                Multiprotocol Type: linux
                Maps: PCMK_MySQL_Test_Cluster=0
                Occupied Size:       0 (0)
                Creation Time: Sat Jan 16 19:31:40 GMT 2016
                Cluster Shared Volume Information: 0x0
                Read-Only: disabled

When the LUN is ready, rescan iSCSI sessions to pic up the new device:

[ALL]# iscsiadm -m session --rescan
[pcmk01]# multipath -ll
360a980003830354d66244675306b7343 dm-2 NETAPP  ,LUN
size=5.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=2 status=active
  |- 5:0:0:0 sdb 8:16 active ready running
  `- 6:0:0:0 sdc 8:32 active ready running

4. Configure STONITH (aka Node Fencing)

Note that the cluster property stonith-enabled may not be deactivated to use the DLM. Clusters with shared data need STONITH to ensure data integrity.

Install a fencing agent suitable for VMware environment:

[ALL]# yum install -y fence-agents-vmware-soap

Populate the file with the current raw XML config from the CIB:

[pcmk01]# pcs cluster cib stonith_cfg

Create a new STONITH resource called my_vcentre-fence:

[pcmk01]# pcs -f stonith_cfg stonith create my_vcentre-fence fence_vmware_soap \
 ipaddr=vcentre ipport=443 ssl_insecure=1 inet4_only=1 \
 login="vcentre-account" passwd="passwd" \
 action=reboot \
 pcmk_host_map="pcmk01-cr:vm-pcmk01;pcmk02-cr:vm-pcmk02;pcmk03-cr:vm-pcmk03" \
 pcmk_host_check=static-list \
 pcmk_host_list="vm-pcmk01,vm-pcmk02,vm-pcmk03" \
 power_wait=3 op monitor interval=90s

Enable STONITH, set its action and timeout, and commit the changes:

[pcmk01]# pcs -f stonith_cfg property set stonith-enabled=true
[pcmk01]# pcs -f stonith_cfg property set stonith-action=reboot
[pcmk01]# pcs -f stonith_cfg property set stonith-timeout=120s
[pcmk01]# pcs cluster cib-push stonith_cfg

Check all currently configured STONITH properties:

[pcmk01]# pcs property list --all|grep stonith
 stonith-action: reboot
 stonith-enabled: true
 stonith-timeout: 120s
 stonith-watchdog-timeout: (null)

We can also check all property defaults:

[pcmk01]# pcs property list --defaults

Show all currently configured STONITH devices:

[pcmk01]# pcs stonith show --full
 Resource: my_vcentre-fence (class=stonith type=fence_vmware_soap)
  Attributes: ipaddr=vcentre ipport=443 ssl_insecure=1 inet4_only=1 login=vcentre-account passwd=passwd action=reboot pcmk_host_map=pcmk01-cr:vm-pcmk01;pcmk02-cr:vm-pcmk02;pcmk03-cr:vm-pcmk03 pcmk_host_check=static-list pcmk_host_list=vm-pcmk01,vm-pcmk02,vm-pcmk03 power_wait=3
  Operations: monitor interval=90s (my_vcentre-fence-monitor-interval-90s)

Test STONITH, reboot the third cluster node, make sure that we use the Corosync interface for this:

[pcmk01]# stonith_admin --reboot pcmk03-cr

If we get timeouts, we can increase the monitoring interval:

[pcmk01]# pcs stonith update my_vcentre-fence op monitor interval=120s

Or do a simple cleanup, sometimes it’s all that’s needed:

[pcmk01]# pcs stonith cleanup

Cleanup tells the cluster to forget the operation history of a stonith device and re-detect its current state. It can be useful to purge knowledge of past failures that have since been resolved.

5. Configure Pacemaker for GFS2

We want to prevent healthy resources from being moved around the cluster. We can specify a different stickiness for every resource, but it is often sufficient to change the default.

[pcmk01]# pcs resource defaults resource-stickiness=200
[pcmk01]# pcs resource defaults
resource-stickiness: 200

Install the GFS2 command-line utilities and the Distributed Lock Manager (DLM) required by cluster filesystems:

[ALL]# yum install gfs2-utils lvm2-cluster

Enable clustered locking for LVM:

[ALL]# lvmconf --enable-cluster

This sets locking_type to 3 on the system and disables lvmetad use as it is not yet supported in clustered environment. Another way of doing this would be to open the /etc/lvm/lvm.conf file and set:

locking_type = 3

The DLM needs to run on all nodes, so we’ll start by creating a resource for it (using the ocf:pacemaker:controld resource script), and clone it. Note that a dlm resource is a required dependency for clvmd and GFS2.

[pcmk01]# pcs cluster cib dlm_cfg
[pcmk01]# pcs -f dlm_cfg resource create dlm ocf:pacemaker:controld \
  op monitor interval=120s on-fail=fence clone interleave=true ordered=true

Set up clvmd as a cluster resource.

[pcmk01]# pcs -f dlm_cfg resource create clvmd ocf:heartbeat:clvm \
  op monitor interval=120s on-fail=fence clone interleave=true ordered=true

Set up clvmd and dlm dependency and start up order. Create the ordering and the colocation constraint so that clvm starts after dlm and that both resources start on the same node.

[pcmk01]# pcs -f dlm_cfg constraint order start dlm-clone then clvmd-clone
[pcmk01]# pcs -f dlm_cfg constraint colocation add clvmd-clone with dlm-clone

Set the no-quorum-policy of the cluster to freeze so that that when quorum is lost, the remaining partition will do nothing until quorum is regained – GFS2 requires quorum to operate.

[pcmk01]# pcs -f dlm_cfg property set no-quorum-policy=freeze

Let us check the configuration:

[pcmk01]# pcs -f dlm_cfg constraint
Location Constraints:
Ordering Constraints:
  start dlm-clone then start clvmd-clone (kind:Mandatory)
Colocation Constraints:
  clvmd-clone with dlm-clone (score:INFINITY)
[pcmk01]# pcs -f dlm_cfg resource show
 Clone Set: dlm-clone [dlm]
     Stopped: [ pcmk01-cr pcmk02-cr pcmk03-cr ]
 Clone Set: clvmd-clone [clvmd]
     Stopped: [ pcmk01-cr pcmk02-cr pcmk03-cr ]

Commit changes:

[pcmk01]# pcs cluster cib-push dlm_cfg
[pcmk01]# pcs resource show
 Clone Set: dlm-clone [dlm]
     Started: [ pcmk01-cr pcmk02-cr pcmk03-cr ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ pcmk01-cr pcmk02-cr pcmk03-cr ]
[pcmk01]# pcs property list no-quorum-policy
Cluster Properties:
 no-quorum-policy: freeze

LVM Configuration

Create LVM Objects

We will create LVM objects from a single cluster node.

[pcmk01]# pvcreate /dev/mapper/360a980003830354d66244675306b7343
[pcmk01]# vgcreate --autobackup=y --clustered=y vg_cluster /dev/mapper/360a980003830354d66244675306b7343
[pcmk01]# lvcreate --size 512M --name lv_storage vg_cluster

There are cases when we might receive the following error:

connect() failed on local socket: No such file or directory
Internal cluster locking initialisation failed.
WARNING: Falling back to local file-based locking.
Volume Groups with the clustered attribute will be inaccessible.

The above indicates that we have cluster locking enabled, but that the cluster LVM daemon (clvmd) is not running. Make sure it’s started via Pacemaker.

Create Clustered Filesystem

GFS2 requires one journal for each node in the cluster that needs to mount the file system. For example, if we have a 16-node cluster but need to mount only the file system from two nodes, we need only two journals. If we need to mount from a third node, we can always add a journal with the gfs2_jadd command. GFS2 allows to add journals on the fly.

The default GFS2 journal size is 128MB. The minimum size is 8MB. Larger journals improve performance, although they use more memory than smaller journals. The default size of 128MB means that if we have a 512MB logical volume and a 3-node cluster, creating a GFS2 filesystem will not work, we’ll get an error saying there isn’t enough free space:

Failed to create resource group index entry: No space left on device

When determining the number of nodes that our system will contain, there is always a trade-off between high availability and performance. With a larger number of nodes, it becomes increasingly difficult to make workloads scale. For that reason, Red Hat does not support using GFS2 for cluster file system deployments greater than 16 nodes. When deciding on the number of journals, one journal is required for each cluster node which is to mount the GFS2 file system.

It is generally recommended to use the default journal size of 128MB. However, since our file system is very small (only 5GB), having a 128MB journal is simply impractical.

It is also recommended that we do not run a file system that is more than 85 percent full, although this figure may vary depending on workload.

In our particular case: 3 nodes x 32MB journals = 96MB + some space for resource groups = around 100MB.

Create a clustered filesystem with three journals, where the journal size is 32MB. Note that the cluster name must match that in cluster.conf as only members of this cluster are permitted to use this filesystem.

Acceptable locking protocols are lock_dlm, lock_gulm or if we are using GFS2 as a local filesystem (one node only), we can specify the lock_nolock protocol.

[pcmk01]# mkfs.gfs2 -j3 -J32 -t gfs_cluster:gfs2_storage -p lock_dlm /dev/vg_cluster/lv_storage
/dev/vg_cluster/lv_mysql01 is a symbolic link to /dev/dm-3
This will destroy any data on /dev/dm-3
Are you sure you want to proceed? [y/n]y
Device:                    /dev/vg_cluster/lv_storage
Block size:                4096
Device size:               0.50 GB (131072 blocks)
Filesystem size:           0.50 GB (131068 blocks)
Journals:                  3
Resource groups:           5
Locking protocol:          "lock_dlm"
Lock table:                "gfs_cluster:gfs2_storage"
UUID:                      00a3fa40-d95c-904a-fd59-9fe3baa2b283

A couple of ways to check for a cluster name:

[pcmk01]# pcs property list cluster-name
Cluster Properties:
 cluster-name: gfs_cluster
# grep name /etc/corosync/corosync.conf
  cluster_name: gfs_cluster

Create a mountpoint:

[ALL]# mkdir -p /cluster/storage

Create Pacemaker Filesystem Resource

It is generally recommended to mount GFS2 file systems with the noatime and nodiratime arguments. This allows GFS2 to spend less time updating disk inodes for every access.

[pcmk01]# pcs resource create gfs2_res01 Filesystem device="/dev/vg_cluster/lv_storage" \
  directory="/cluster/storage" fstype="gfs2" options="noatime,nodiratime,rw" \
  op monitor interval=90s on-fail=fence clone interleave=true

We need to define a _netdev option when we use LVM on a GFS filesystem over a partition provided via the iSCSI protocol. To do so, we will simply update the filesystem resource:

[pcmk01]# pcs resource update gfs2_res01 options="noatime,nodiratime,rw,_netdev"

File systems mounted with the _netdev flag are mounted when the network has been enabled on the system.

We may optionally want to check a GFS2 file system at boot time by setting the run_fsck parameter of the options argument.

Let’s check for gfs2 mounts:

[pcmk01]# mount|grep gfs2
/dev/mapper/vg_cluster-lv_storage on /cluster/storage type gfs2 (rw,noatime,nodiratime,seclabel,_netdev))

Caveat: Avoid SELinux on GFS2

As per RedHat documentation, SELinux is highly recommended for security reasons in most situations, but it is not supported for use with GFS2. SELinux stores information using extended attributes about every file system object. Reading, writing, and maintaining these extended attributes is possible but slows GFS2 down considerably.

We must turn SELinux off on a GFS2 file system (not on the whole server!) when we mount the file system, using one of the context options as described in the mount man page.

The default security context should be unlabeled:

[pcmk01]# ls -dZ /cluster/storage
drwxr-xr-x. root root system_u:object_r:unlabeled_t:s0 /cluster/storage/

The xattr part for GFS2 should be loaded:

[pcmk01]# dmesg|grep xattr
[    2.671313] SELinux: initialized (dev dm-0, type ext4), uses xattr
[    4.623928] SELinux: initialized (dev sda1, type ext2), uses xattr
[   26.107473] SELinux: initialized (dev dm-3, type gfs2), uses xattr

We are going to change security context to public content, only one security context per filesystem.

Files labeled with the public_content_t type allow them to be read by FTP, Apache, Samba and rsync. Files labeled with the public_content_rw_t allow them to be written (some services, such as Samba, require booleans to be set before they can write!).

Update the filesystem resource:

[pcmk01]# pcs resource update gfs2_res01 options='noatime,nodiratime,rw,_netdev,context="system_u:object_r:public_content_rw_t:s0"'

Mount options and security context should be changed:

[pcmk01]# mount|grep gfs2
/dev/mapper/vg_cluster-lv_storage on /cluster/storage type gfs2 (rw,noatime,nodiratime,context=system_u:object_r:public_content_rw_t:s0,_netdev)
[pcmk01]# ls -dZ /cluster/storage
drwxr-xr-x. root root system_u:object_r:public_content_rw_t:s0 /cluster/storage/

Reboot the system (or fence it from another node!) and check the kernel message buffer, the xattr part should not be loaded:

[pcmk01]# reboot
[pcmk01]# dmesg|grep xattr
[    2.424703] SELinux: initialized (dev dm-0, type ext4), uses xattr
[    3.786307] SELinux: initialized (dev sda1, type ext2), uses xattr

We can always check for other SELinux content available if the current one is not suitable:

# semanage fcontext -l|less

Create Pacemaker Resource Ordering

Now, we need to make sure that the cluster LVM daemon (clvmd) is started before any attempt to mount the GFS2 volume, otherwise our logical device /dev/vg_cluster/lv_storage won’t be found as volume groups with the clustered attribute will be inaccessible. GFS2 must start after clvmd and must run on the same node as clvmd.

[pcmk01]# pcs constraint order start clvmd-clone then gfs2_res01-clone
[pcmk01]# pcs constraint colocation add gfs2_res01-clone with clvmd-clone
[pcmk01]# pcs constraint show
Location Constraints:
Ordering Constraints:
  start dlm-clone then start clvmd-clone (kind:Mandatory)
  start clvmd-clone then start gfs2_res01-clone (kind:Mandatory)
Colocation Constraints:
  clvmd-clone with dlm-clone (score:INFINITY)
  gfs2_res01-clone with clvmd-clone (score:INFINITY)

How to Check GFS2 Filesystem Offline

All nodes must have the GFS2 filesystem unmounted before running fsck.gfs2. Failure to unmount from all nodes in a cluster will likely result in filesystem corruption.

[pcmk01]# pcs resource disable --wait=5 gfs2_res01
[pcmk01]# fsck.gfs2 /dev/vg_cluster/lv_storage
[pcmk01]# pcs resource enable gfs2_res01

How to Grow GFS2 Filesystem Online

The gfs2_grow command is used to expand a GFS2 filesystem after the device upon which the filesystem resides has also been expanded.

Let us expand LVM first:

[pcmk01]# lvextend --size +512M /dev/vg_cluster/lv_storage

We may only run gfs2_grow on a mounted filesystem as an expansion of unmounted filesystems is not supported. We only need to run gfs2_grow on one node in the cluster.

Note that we can run gfs2_grow with the -T flag to get a display of the current state of a mounted GFS2 filesystem.

[pcmk01 ~]# gfs2_grow /dev/vg_cluster/lv_storage
# df -hT /cluster/storage
Filesystem                        Type  Size  Used Avail Use% Mounted on
/dev/mapper/vg_cluster-lv_storage gfs2  1.0G  100M  925M  10% /cluster/storage

How to Add Journals to GFS2

The number of journals can be fetched by gfs2_edit -p jindex. Do not execute this command when the file system is mounted.

[pcmk01]# gfs2_edit -p jindex /dev/vg_cluster/lv_storage|grep journal
   3/3 [fc7745eb] 1/18 (0x1/0x12): File    journal0
   4/4 [8b70757d] 2/8231 (0x2/0x2027): File    journal1
   5/5 [127924c7] 3/16444 (0x3/0x403c): File    journal2

If a GFS2 file system is full, the gfs2_jadd will fail, even if the logical volume containing the file system has been extended and is larger than the file system. This is because in a GFS2 file system, journals are plain files rather than embedded metadata, so simply extending the underlying logical volume will not provide space for the journals.

# gfs2_jadd -j 1 /dev/vg_cluster/lv_storage
Filesystem: /dev/vg_cluster/lv_storage
Old journals: 3
New journals: 4
[pcmk01]# gfs2_edit -p jindex /dev/vg_cluster/lv_storage|grep journal
   3/3 [fc7745eb] 1/18 (0x1/0x12): File    journal0
   4/4 [8b70757d] 2/8231 (0x2/0x2027): File    journal1
   5/5 [127924c7] 3/16444 (0x3/0x403c): File    journal2
   6/6 [657e1451] 4/131340 (0x4/0x2010c): File    journal3

How to Display, print or edit GFS2 or GFS internal structures

The gfs2_edit command is a tool used to examine, edit or display internal data structures of a GFS2 or GFS file system.

References

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Clusters_from_Scratch/index.html
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/Global_File_System_2/index.html
http://download.parallels.com/doc/psbm/v5/rtm/Deploying_Clusters_in_Parallels-Based_Systems/30776.htm
http://selinuxproject.org/page/FileStatements
https://www.redhat.com/archives/linux-cluster/2012-January/msg00099.html

30 thoughts on “Active/Active High Availability Pacemaker Cluster with GFS2 and iSCSI Shared Storage on CentOS 7

  1. Hello I have create demo like your post. But when i wannt to fence one node its got This Error “Command failed: No route to host”
    Have you any idea why am i getting this error. I have done althing like your post.

    This is Erro Log in /var/log/message log file.

    Jun 17 07:53:45 node1 stonith-ng[3008]: notice: Client stonith_admin.18958.56c576c7 wants to fence (reboot) ‘Node2’ with device ‘(any)’
    Jun 17 07:53:45 node1 stonith-ng[3008]: notice: Initiating remote operation reboot for Node2: 897fe5e8-a7ec-4f7b-8d61-bf6367c4b4b6 (0)
    Jun 17 07:53:45 node1 stonith-ng[3008]: notice: my_vcentre-fence can fence (reboot) Node2 (aka. ‘vmnode2’): static-list
    Jun 17 07:53:45 node1 stonith-ng[3008]: notice: my_vcentre-fence can fence (reboot) Node2 (aka. ‘vmnode2’): static-list
    Jun 17 07:53:50 node1 fence_vmware_soap: Failed: Unable to obtain correct plug status or plug is not available

    Jun 17 07:53:56 node1 fence_vmware_soap: Failed: Unable to obtain correct plug status or plug is not available
    Jun 17 07:53:56 node1 stonith-ng[3008]: error: Operation ‘reboot’ [18966] (call 2 from stonith_admin.18958) for host ‘Node2’ with device ‘my_vcentre-fence’ returned: -201 (Gen eric Pacemaker error)
    Jun 17 07:53:56 node1 stonith-ng[3008]: warning: my_vcentre-fence:18966 [ /usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769: InsecureRequestWarning: Unverified HTTP S request is being made. Adding certificate verification is strongly advised. See: https:// urllib3.readthedocs.org/en/latest/security.html ]
    Jun 17 07:53:56 node1 stonith-ng[3008]: warning: my_vcentre-fence:18966 [ InsecureRequestWarning) ]
    Jun 17 07:53:56 node1 stonith-ng[3008]: warning: my_vcentre-fence:18966 [ Failed: Unable to obtain correct plug status or plug is not available ]
    Jun 17 07:53:56 node1 stonith-ng[3008]: warning: my_vcentre-fence:18966 [ ]
    Jun 17 07:53:56 node1 stonith-ng[3008]: warning: my_vcentre-fence:18966 [ ]
    Jun 17 07:54:19 node1 stonith-ng[3008]: notice: Couldn’t find anyone to fence (reboot) Node2 with any device
    Jun 17 07:54:19 node1 stonith-ng[3008]: error: Operation reboot of Node2 by for [email protected]: No route to host
    Jun 17 07:54:19 node1 crmd[3012]: notice: Peer Node2 was not terminated (reboot) by for node1: No route to host (ref=897fe5e8-a7ec-4f7b-8d61-bf6367c4b4b6) by client st onith_admin.18958

    • Your fencing device isn’t configured properly:

      fence_vmware_soap: Failed: Unable to obtain correct plug status or plug is not available

      Does it fail when calling fence_vmware_soap from the command line? Try it and see if you get a list of nodes available. Ensure the username and password are correct.

    • I had the same issue, in case anyone else stumbles on to I resolved by populating the pcmk_host_list with the VMware UUID for the nodes as opposed to names.

  2. Hi, I can get the resources created and the LV created but the LV remains inactive on the 2nd node? (2node cluster). I can mount and access the lv on the first node but it will not activate on the second. I have tried “lgchange -ay” to no avail

    • The exact error on “pcs status”:

      Failed Actions:
      * gfs2_res01_start_0 on kvm-node2.coobe.local ‘not installed’ (5): call=72, status=complete, exitreason=’Couldn’t find device [/dev/cluster_vm_group/cluster_vm_volume]. Expected /dev/??? to exist’,
      last-rc-change=’Tue Oct 18 10:03:19 2016′, queued=0ms, exec=60ms

      If I do a “lvs” on node2, the volume and volume group show up, “lvscan” shows the lv as inactive so I’m assuming that’s the cause for the error.

    • OK, one of the cluster nodes (kvm-node2) is unable to see the shared storage assigned to the OS, so I take it the shared storage devices were not shared correctly between the cluster nodes.

  3. You may be on to something.

    fdisk -l | grep Disk on node1:
    Disk /dev/mapper/mpatha: 644.2 GB, 644245094400 bytes, 1258291200 sectors
    Disk /dev/mapper/rhel-home: 21.5 GB, 21466447872 bytes, 41926656 sectors
    Disk /dev/mapper/cluster_vm_group-cluster_vm_volume: 644.2 GB, 644240900096 bytes, 1258283008 sectors

    fdisk -l | grep Disk on node2:
    Disk /dev/mapper/mpatha: 644.2 GB, 644245094400 bytes, 1258291200 sectors
    Disk /dev/mapper/rhel-home: 84.0 GB, 84041269248 bytes, 164143104 sectors

    /dev/mapper/mpatha is where I created the physical volume/volume group from.

    However:

    vgs on node2:
    VG #PV #LV #SN Attr VSize VFree
    cluster_vm_group 1 1 0 wz–nc 600.00g 0
    rhel 1 3 0 wz–n- 136.21g 64.00m

    lvs on node2:

    LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
    cluster_vm_volume cluster_vm_group -wi——- 600.00g
    home rhel -wi-ao—- 78.27g
    root rhel -wi-ao—- 50.00g
    swap rhel -wi-ao—- 7.88g

    So from that standpoint it looks like it can “see” the volume?

    • If your fdisk output is correct, then the node2 second node cannot see the /dev/mapper/cluster_vm_group-cluster_vm_volume. Why is that? What does “multipath -ll” show on both nodes?

    • Node1:
      mpathb (3600143801259b8b40001100000190000) dm-2 HP ,HSV340
      size=800G features=’1 queue_if_no_path’ hwhandler=’0′ wp=rw
      |-+- policy=’service-time 0′ prio=50 status=active
      | |- 1:0:2:2 sdh 8:112 active ready running
      | |- 1:0:3:2 sdj 8:144 active ready running
      | |- 2:0:2:2 sdp 8:240 active ready running
      | `- 2:0:3:2 sdr 65:16 active ready running
      `-+- policy=’service-time 0′ prio=10 status=enabled
      |- 1:0:0:2 sdd 8:48 active ready running
      |- 2:0:1:2 sdn 8:208 active ready running
      |- 2:0:0:2 sdl 8:176 active ready running
      `- 1:0:1:2 sdf 8:80 active ready running
      mpatha (3600143801259b8b400011000000d0000) dm-3 HP ,HSV340
      size=600G features=’1 queue_if_no_path’ hwhandler=’0′ wp=rw
      |-+- policy=’service-time 0′ prio=50 status=active
      | |- 2:0:1:1 sdm 8:192 active ready running
      | |- 2:0:0:1 sdk 8:160 active ready running
      | |- 1:0:1:1 sde 8:64 active ready running
      | `- 1:0:0:1 sdc 8:32 active ready running
      `-+- policy=’service-time 0′ prio=10 status=enabled
      |- 1:0:3:1 sdi 8:128 active ready running
      |- 1:0:2:1 sdg 8:96 active ready running
      |- 2:0:2:1 sdo 8:224 active ready running
      `- 2:0:3:1 sdq 65:0 active ready running

      Node2:
      mpathb (3600143801259b8b40001100000190000) dm-2 HP ,HSV340
      size=800G features=’1 queue_if_no_path’ hwhandler=’0′ wp=rw
      |-+- policy=’service-time 0′ prio=50 status=active
      | |- 1:0:2:2 sdh 8:112 active ready running
      | |- 1:0:3:2 sdj 8:144 active ready running
      | |- 2:0:3:2 sdr 65:16 active ready running
      | `- 2:0:2:2 sdp 8:240 active ready running
      `-+- policy=’service-time 0′ prio=10 status=enabled
      |- 1:0:1:2 sdf 8:80 active ready running
      |- 2:0:0:2 sdl 8:176 active ready running
      |- 2:0:1:2 sdn 8:208 active ready running
      `- 1:0:0:2 sdd 8:48 active ready running
      mpatha (3600143801259b8b400011000000d0000) dm-3 HP ,HSV340
      size=600G features=’1 queue_if_no_path’ hwhandler=’0′ wp=rw
      |-+- policy=’service-time 0′ prio=50 status=active
      | |- 1:0:0:1 sdc 8:32 active ready running
      | |- 1:0:1:1 sde 8:64 active ready running
      | |- 2:0:0:1 sdk 8:160 active ready running
      | `- 2:0:1:1 sdm 8:192 active ready running
      `-+- policy=’service-time 0′ prio=10 status=enabled
      |- 1:0:2:1 sdg 8:96 active ready running
      |- 1:0:3:1 sdi 8:128 active ready running
      |- 2:0:2:1 sdo 8:224 active ready running
      `- 2:0:3:1 sdq 65:0 active ready running

  4. Hi Tomas,

    Thank for your guide, i have a question. In this post, iSCSI storage server is required ?. Every guides i saw, they had one iscsi storage server and two or more nodes :(
    Pls, respond me as soon as possible !
    Thanks and Regard !

    • But your post have not section about iscsi server :(, can you update this guide ? about installation and cofiguration iscsi server ?

      Thank you so much !

    • iSCSI server setup is a separate subject, and beyond the scope of the article really, with the exception of several references to the NetApp SAN that we use for iSCSI storage. It is expected that you have a working iSCSI server already, apologies if this got you confused. I’ve updated the article to make it clear. If you need to create an iSCSI server from scratch, you may want to take a look at this post about setting up and iSCSI target on RHEL 7.

      Once again, the post is about creating an active/active Pacemaker cluster on CentOS 7, if you’re setting such cluster up, you need to have some shared storage before you start.

    • Hi Tomas,

      I’m so confused in step: NetApp Data ONTAP SAN: Create a LUN on and iSCSI Server. Can you detail this step ?

      Thanks and Regard !

    • Hi Bruce, sharing is caring, thanks very much! I’ve taken a look at your video tutorials, and they have a fair amount of useful info, however, quite a few bits (package installation etc) could’ve been fast-forwarded. Most people who take onto RHCA have RHCE level skills.

  5. Can any buddy help me from here .
    Stack: corosync
    Current DC: cloud2.cubexs.net.pk (version 1.1.16-12.el7_4.5-94ff4df) – partition with quorum
    Last updated: Fri Dec 8 17:06:57 2017
    Last change: Fri Dec 8 16:59:36 2017 by root via cibadmin on cloud1.cubexs.net.pk

    2 nodes configured
    5 resources configured

    Online: [ cloud1.cubexs.net.pk cloud2.cubexs.net.pk ]

    Full list of resources:

    VirtIP (ocf::heartbeat:IPaddr2): Started cloud1.cubexs.net.pk
    Httpd (ocf::heartbeat:apache): Stopped
    Master/Slave Set: DrbdDataClone [DrbdData]
    Masters: [ cloud1.cubexs.net.pk ]
    Slaves: [ cloud2.cubexs.net.pk ]
    DrbdFS (ocf::heartbeat:Filesystem): Stopped

    Failed Actions:
    * DrbdFS_start_0 on cloud1.cubexs.net.pk ‘not configured’ (6): call=286, status=complete, exitreason=’none’,
    last-rc-change=’Fri Dec 8 17:05:42 2017′, queued=0ms, exec=340ms

    Daemon Status:
    corosync: active/disabled
    pacemaker: active/disabled
    pcsd: active/disabled

    • Just to clarify, this article is for GFS2, it does not use DRBD. Many things can go wrong while configuring clustering, therefore it’s hard to tell what was the exact problem in your case. It is however evident that the DRDB resource isn’t running, therefore you may want to re-do the configuration and check for any DRBD-related errors.

  6. hi tomas

    i have check that drbd is running successfully .i did it manually fail over the drbd services and other services are working fine. but via pacemaker the drbd is not fail over automatically .can you please tell where is the issue.

  7. Hi Tomas , you can give an example of a working environment for an active/active cluster ? You can deploy LAMP on your cluster ?

Leave a Reply

Your email address will not be published. Required fields are marked *