7.4.1 Corrective Action in the event of a resource failure

If a failure occurs in a resource, you can use the functions of PRIMECLUSTER and the operating system to detect the failure and identify the faulted resource that caused the failure.

The descriptions given in (a) to (g) below are relevant to the "Failure confirmation features list" given below:

Failure detection

Normally, the RMS main window (a) is used to monitor the cluster applications.

If a failure occurs in a resource or the system
Failover of the userApplication or node panic will occur.
In such a case, you can detect the failure by observing the following conditions:
- The color of the icons in the RMS main window (a) changes.
- A message is output to the msg main window (c), syslog(f), and the console (g).
If a warning-level failure occurs in the system
If a warning-level failure (for example, insufficient disk space or insufficient swap area) occurs in the system, you can detect the failure by observing the following conditions:
- A message is output to syslog(f) and the console (g).
userApplication is not started at the startup of RMS
If RMS fails to start on all the nodes, the userApplication will not start. You can start the userApplication by executing the "clreply" command.
- By executing the "clreply" command, you can confirm an operator intervention request to which no response has been entered and start up the userApplication by responding to it. For information on the "clreply" command, see the manual pages.
- The operator intervention request message will be output to syslog(f) and the console (g). By responding to the operator intervention request message, you can start the userApplication.
For further details, see "4.2 Operator Intervention Messages" in "PRIMECLUSTER Messages."
Note
If there are multiple operator intervention request messages for which no response has yet been entered, you need to respond to each of them.

In addition, you can use the features described in "Failure confirmation features list" to detect the failure.

Cause identification

You can also use the function that detected the failure and the features listed in "Failure confirmation features list" below to identify the faulted resource that caused the failure.

Failure confirmation features list

Failure confirmation features		Manual reference
(a)	RMS main window The RMS tree and the RMS cluster table can be used from this screen.	7.1.3 RMS Main Window
(b)	CF main window The CF tree can be used from this screen.	7.1.1 CF Main Window
(c)	MSG main window The cluster control messages can be viewed in this screen. To display this screen, select the msg tab in the Cluster Admin screen.	-
(d)	Application log	7.3.4.2 Viewing application logs
(e)	switchlog	7.3.4.1 Viewing switchlogs
(f)	syslog	-
(g)	Console * Messages that are displayed on the console or syslog can be checked. Viewing the "console problem" information on the console can help you identify the fault cause.	PRIMECLUSTER Messages
(h)	GDS GUI	PRIMECLUSTER Global Disk Services Configuration and Administration Guide

Note

Console

The operator intervention request messages (message numbers: 1421, 1423), incurred when RMS is not started on all the nodes, are displayed only when yes(1) is set for the AutoStartUp attribute of the userApplication. For information on the userApplication attribute, see "Appendix D Attributes" in "PRIMECLUSTER Reliant Monitor Services (RMS) with Wizard Tools Configuration and Administration Guide."
The operator intervention request messages (message numbers: 1422, 1423) and the error resource messages incurred after a resource or system error occurs are displayed only when yes(1) is set for the PersistentFault attribute of the userApplication. For information on the userApplication attribute, see "Appendix D Attributes" in "PRIMECLUSTER Reliant Monitor Services (RMS) with Wizard Tools Configuration and Administration Guide."
The operator intervention request and error resource messages are displayed by using the "clwatchlogd" daemon to monitor switchlog. You need to send the SIGHUP signal to clwatchlogd when you change the value of RELIANT_LOG_PATH that is defined in the "hvenv.local" file. When clwatchlogd receives this signal, clwatchlogd acquires the latest value of RELIANT_LOG_PATH. After you change RELIANT_LOG_PATH, you must start RMS.

Note

When you check the message of a resource failure, a resource with the "MONITORONLY" attribute may be in the fault state even if the cluster application is in the Offline state. Check whether there are any resources in the fault state. Especially, check that Fsystem resources are not in the fault state.

7.4.1.2 Corrective Action for Failed Resources

Take the following steps for failed resources:

Correct the faulted resource
Correct the problem in the failed resource. For details, see "PRIMECLUSTER Reliant Monitor Services (RMS) with Wizard Tools Configuration and Administration Guide."
Note
If you are using an operation management product other than a PRIMECLUSTER product, you may need to take corrective actions prescribed for that product.
For details, see the manual provided with each operation management product.
(Example) Symfoware
Recover the cluster application
At the RMS main window, check the state of the cluster application to which the corrected resource is registered. If the cluster application is in the Faulted state, execute the Fault clear operation.
For details on the Fault clear operation, see "7.2.2.4 Bringing Faulted Cluster Application to available state."
Clear the fault trace of the failure resource
Clear the fault trace of the failure resource. For more information, refer to "7.2.3.3 Clearing Fault Traces of Resources."

7.4.1.3 Recovery of Failed Cluster Interconnect

The following problems can cause cluster interconnect failures.

Hardware error
- Error on LAN card, hub, or cable
- Connection error
Network configuration error
- Configuration error on IP address, netmask, or routing information, etc.

Contact your system administrator on the network configuration error. The following section describes how to fix hardware related errors.

If any heartbeat error on the cluster interconnect is detected, either of the following messages will be output to the /var/log/messages file.

"CF: Problem detected on cluster interconnect NIC_NAME to node NODE_NAME: missing heartbeat replies. (CODE)"
"CF: Problem detected on cluster interconnect NIC_NAME to node NODE_NAME: ICF route marked down. (CODE)"

"NIC_NAME" indicates the network interface card on which the error is detected.

"NODE_NAME" indicates the CF node name on which the error is detected.

"CODE" indicates the necessary information to determine the cause.

When either of the above messages is output to the file, follow the steps below.

Corrective action

Determining the failed node
Confirm that each device is working properly. You can also use the ping command to determine the failed node and its location.
Note
When an error on the entire cluster interconnects (all interconnects for every node) occurs, the cluster system forcibly shut down all the nodes except one which has the highest survival priority.
For details on survival priority, see "5.1.2 Setting up the Shutdown Facility."
If an error on an active node (e.g. LAN card error of a node on which an active cluster application resides) occurs, you must stop the node before fixing it. To minimize the down time, make sure to follow the steps below before performing "Step 2. Performing maintenance tasks."
1. Stopping a node in the "Online" state
  Before performing the maintenance task, stop the node on which "Online" cluster application resides.
2. Starting the forcefully terminated node
  Start the node which was forcefully terminated by the cluster system and make the cluster application back to the "Online" state. For details on how to start a cluster application, see "7.2.1.1 Starting RMS."
  Be sure to check that the node, which is described in Step 1. Stopping a node in the "Online" state, is completely stopped before performing this step.
Performing maintenance tasks
After determining the cause of the error, perform the following maintenance task depending on the category of error.
Note
For a LAN card error, the failed node must be stopped to perform the maintenance task.
For an error on cables or hubs, you can perform the maintenance task with the node being active.
- When the error was caused by your LAN card or cable
  If the cable is unplugged, plug in properly.
  If the cable is properly plugged, your LAN card might be the cause. Contact field engineers.
- When the error was caused by a hub
  If the power is off , push the power button.
  If the power is on, there is a possibility the hub is broken down. Contact field engineers.
Recovery
To recover the partial failure of the cluster interconnect, skip to "Step 2. Cluster interconnect recovery" below.
1. Starting all the nodes
  Start all the nodes.
2. Cluster interconnect recovery
  Use the ping command to confirm if nodes can communicate each other through the failed cluster interconnect.
After confirming that the cluster interconnect is recovered successfully, clear the "Faulted" state of the cluster application as necessary. For details on the operation, see "7.2.2.4 Bringing Faulted Cluster Application to available state."