3.1.4 Setting Up the Cluster High-Speed Failover Function

PRIMECLUSTER Installation and Administration Guide 4.2 (Linux)

Contents Index

Part 2 Installation

> Chapter 3 Software Installation

> 3.1 Installation and Setup of Related Software

3.1.4 Setting Up the Cluster High-Speed Failover Function

Overview

If heartbeat monitoring fails because of a node failure, PRIMECLUSTER shutdown facility removes the failed node. If this occurs during crash dump collection, you might not be able to acquire information for troubleshooting.

The cluster high-speed failover function prevents node elimination during crash dump collection, and at the same time, enables the ongoing operations on the failed node to be quickly moved to the another node.

The crash dump collection facility varies depending on the version of RHEL being used.

Version of Red Hat Enterprise Linux	Crash dump collection facility
RHEL-AS3/RHEL-ES3	Netdump
RHEL-AS3 batch correction U05011/ RHEL-ES3 batch correction U05011	Netdump or Diskdump
RHEL-AS4 batch correction U05111	Diskdump

Netdump

As shown in the above figure, the cluster high-speed failover function sets up and refers to the panic status of the Netdump server if the heartbeat fails. The node that detected a heartbeat error assumes that the failed node enters the Offline mode without forced power-off of the node for which the crash dump is being output, so that this node can take over the transactions.

If the Netdump server stops, a crash dump cannot be collected and the node will be shut down forcibly from the RSB shutdown agent of another node within the cluster system.
If an error (a network failure, etc.) occurs between the Netdump client and the Netdump server during crash dump collection, crash dump collection will be disabled and the node will be shut down forcibly from the RSB shutdown agent of another node configuring the cluster system.
If you reset either the Netdump server or paniced node during crash dump collection, crash dump collection will not be performed correctly. Therefore, do not perform a reset during crash dump collection.
The operation of the panicked node after crash dump collection is determined by the Netdump settings.
Netdump cannot be used with Diskdump.

Netdump server (server dedicated to dump collection)

You must prepare another node, to be used as the Netdump server, independently of the cluster nodes. It must be connected to the LAN for the Netdump server (a dedicated LAN). For example, when you build a cluster system configured with four nodes, you must prepare a total of five nodes, one of which will be used as the Netdump server.

To enable to use the Netdump function, you must first set up the Netdump server and Netdump client.

Settings required for the Netdump shutdown agent

Settings for the Netdump server

Confirming the Netdump function

Confirm that the Netdump server function is available. If not, enable it.

Use the "runlevel(8)" command and the "chkconfig(8)" command to confirm the operation.
- Confirm the current run level with the "runlevel(8)" command.
  
  (Example) When the following is given, the current run level is 3.
```
    # /sbin/runlevel
```
```
    N 3
```
- Confirm whether the Netdump server function is available with the "chkconfig(8)" command.
  
  (Example) When the following is given, the Netdump server function at the current run level 3 is Off.
```
    # /sbin/chkconfig --list netdump-server
```
```
    netdump-server  0:Off 1:Off 2:Off 3:Off 4:Off 5:Off 6:Off
```
- If the Netdump server function is Off at the current run level, change it to On with the "chkconfig(8)" command.
```
    # /sbin/chkconfig netdump-server on
```
Confirming the NFS function

The Netdump shutdown agent uses NFS. Confirm if NFS is available. If it is not available, make it available.

Use the "runlevel(8)" command and the "chkconfig(8)" command to confirm the operation.
- Confirm the current run level with the "runlevel(8)" command.
  
  (Example) When the following is given, the current run level is 3.
```
     # /sbin/runlevel
```
```
     N 3
```
- Confirm whether the NFS function is available with the "chkconfig(8)" command.
  
  (Example) When the following is given, the NFS function at the current run level 3 is Off.
```
    # /sbin/chkconfig --list nfs
```
```
    nfs     0:Off  1:Off  2:Off  3:Off  4:Off  5:Off  6:Off
```
- If the NFS function is Off at the current run level, change it to On with the "chkconfig(8)" command.
```
    # /sbin/chkconfig nfs on
```
Setting to avoid rebooting

The Netdump command is used to reboot a node from which a dump was collected after crash dump collection. Set up the following in "/etc/netdump.conf" to prevent the node from rebooting after dump collection.
```
  noreboot=true
```
Setting the NFS function

Set up the following in "/etc/exports."

/var/crash/log/netdump_status NodeA(ro,no_root_squash) NodeB(ro,no_root_squash)
- In "/var/crash/log/netdump_status," describe all mountable nodes that constitute the cluster system.
- Specify the host names of the nodes that constitute the cluster system in NodeA and NodeB.
  
  (Example) When there are three nodes constituting the cluster system, namely, NodeA, NodeB, and NodeC
  
  /var/crash/log/netdump_status NodeA(ro,no_root_squash) NodeB(ro,no_root_squash) NodeC(ro,no_root_squash)
Rebooting the system

Reboot the system.
```
  # shutdown -r now
```

Setting for the Netdump client (cluster system configuration node)

Confirming the NFS function

Confirm if NFS is available. If it is not available, make it available. This operation must be executed on all the nodes that constitute the cluster system.

Use the "runlevel(8)" command and the "chkconfig(8)" command to confirm the operation.
- Confirm the current run level with the "runlevel(8)" command.
  
  (Example) When the following is given, the current run level is 3.
```
    # /sbin/runlevel
```
```
    N 3
```
- Confirm whether the NFS function is available with the "chkconfig(8)" command.
  
  (Example) When the following is given, the NFS function at the current run level 3 is Off.
```
    # /sbin/chkconfig --list nfs
```
```
    nfs     0:Off  1:Off  2:Off  3:Off  4:Off  5:Off  6:Off
```
- If the NFS function is Off at the current run level, change it to On with the "chkconfig(8)" command.
```
    # /sbin/chkconfig nfs on
```
Setting the NFS function

This operation must be executed on all the nodes that constitute the cluster system.
- Create the NFS mount point.
  
  Create the mount point (/var/crash/panicinfo). Create the mount point as follows.
```
    # mkdir -m 0444 -p /var/crash/panicinfo
```
- Set /etc/fstab.
  
  Set up the following in "/etc/fstab."
  
  Netdump_server:/var/crash/log/netdump_status /var/crash/panicinfo nfs ro,fg,soft,noac 0 0
- Specify an IP address or host name of the Netdump server in Netdump_server.
  
  When the host name is to be set up, configure the IP address of the Netdump server in "/etc/hosts."
  
  (Example)
  
  Node0:/var/crash/log/netdump_status /var/crash/panicinfo nfs ro,fg,soft,noac 0 0
Rebooting the system

Reboot the system.

This operation must be executed on all the nodes that configure the cluster system.
```
  # shutdown -r now
```

Diskdump

As shown in the above figure, the cluster fast switching function allows for panic status setting and reference through RSB or BMC (Baseboard Management Controller) when a heartbeat monitoring failure occurs. The node that detects the failure can consider that the other node is stopped and takes over ongoing operation without eliminating the node that is collecting crash dump.

If you reboot the node that is collecting crash dump, collection of the crash dump will fail.
When the node completes collecting the crash dump after it gets panicked, the behavior of the node follows the Diskdump setting.
Diskdump cannot be used with Netdump.

Required setting for the Diskdump shutdown agent

Configure Diskdump

When using Diskdump, it is necessary to configure the Diskdump.
Check Diskdump

Check if the Diskdump is available. If not, enable the Diskdump using the "runlevel(8)" and "chkconfig(8)" commands.
- Check the current run level using the "runlevel(8)" command.
  
  Example)
```
    # /sbin/runlevel
```
```
    N 3
```
  The above example shows that the run level is 3.
- Check if the Diskdump is available using the "chkconfig(8)" command.
  
  Example)
```
    # /sbin/chkconfig --list diskdump
```
```
    diskdump  0:off 1: off 2: off 3: off 4: off 5: off 6: off
```
  The above example shows that the Diskdump of the runlevel 3 is currently off.
- If the Diskdump is off, enable it by executing the "chkconfig(8)" command.
```
    # /sbin/chkconfig diskdump on
```
  Then, start it by executing the service command.
```
    # /sbin/service diskdump start
```

Prerequisites for the other shutdown agent settings

After you completed configuring the Netdump shutdown agent or Diskdump shutdown agent, set the remote service board (RSB), IPMI (Inteligent Platform Management Interface) or BLADE server.

Prerequisites for the RSB shutdown agent settings

Set the following for the remote service board (RSB):

User ID
Password
IP address

For details, see the operation manual provided with the remote service board and the "ServerView User Guide."

Prerequisites for the IPMI shutdown agent settings

Set the following for the IPMI user.

User ID
Password
IP address

For details, see the "User Guide" provided with the hardware and the "ServerView User Guide."

Prerequisites for the BLADE shutdown agent settings

Set the following for the BLADE server:

Install ServerView
Set SNMP community
Set an IP address of the management blade

For details, see the operation manual provided with the hardware and the "ServerView User Guide."

Contents Index