7.4.1.1 Failure Detection and Cause Identification if a Failure Occurs

PRIMECLUSTER Installation and Administration Guide 4.2 (Linux for Itanium)

Contents Index

Part 3 Operations

> Chapter 7 Operations

> 7.4 Corrective Actions for Resource Failures

> 7.4.1 Corrective Action in the event of a resource failure

7.4.1.1 Failure Detection and Cause Identification if a Failure Occurs

If a failure occurs in a resource, you can use the functions of PRIMECLUSTER and the operating system to detect the failure and identify the faulted resource that caused the failure.

The descriptions given in (a) to (g) below are relevant to the "Failure confirmation features list" given below:

Failure detection

Normally, the RMS main window (a) is used to monitor the cluster applications.

If a failure occurs in a resource or the system

Failover of the userApplication or node panic will occur.

In such a case, you can detect the failure by observing the following conditions:
- The color of the icons in the RMS main window (a) changes.
- A message is output to the msg main window (c), Syslog(f), and the console (g).
If a warning-level failure occurs in the system

If a warning-level failure (for example, insufficient disk space or insufficient swap area) occurs in the system, you can detect the failure by observing the following conditions:
- A message is output to Syslog(f) and the console (g).
If RMS fails to start on all the nodes, the userApplication will not start. You can start the userApplication by executing the "clreply" command.
- By executing the "clreply" command, you can confirm an operator intervention request to which no response has been entered and start up the userApplication by responding to it. For information on the "clreply" command, see the manual pages.
- The operator intervention request message will be output to Syslog(f) and the console (g). By responding to the operator intervention request message, you can start the userApplication.
For further details, see "Operator Intervention Messages."

If there are multiple operator intervention request messages for which no response has yet been entered, you need to respond to each of them.

In addition, you can use the features described in "Failure confirmation features list" to detect the failure.

Cause identification

You can also use the function that detected the failure and the features listed in "Failure confirmation features list" below to identify the faulted resource that caused the failure.

Failure confirmation features list

Failure confirmation features		Manual reference
(a)	RMS main window The RMS tree and the RMS cluster table can be used from this screen.	"RMS Main Window"
(b)	CF main window The CF tree can be used from this screen.	"CF Main Window"
(c)	MSG main window The cluster control messages can be viewed in this screen. To display this screen, select the msg tab in the Cluster Admin screen.	-
(d)	Application log	"Viewing application logs"
(e)	switchlog	"Viewing switchlogs"
(f)	Syslog	-
(g)	Console * Messages that are displayed on the console can be checked. Viewing the "console problem" information on the console can help you identify the fault cause.	"Messages"
(h)	GDS GUI	"PRIMECLUSTER Global Disk Services Configuration and Administration Guide."

Console

The operator intervention request messages (message numbers: 1421, 1423), incurred when RMS is not started on all nodes, are displayed only when yes(1) is set for the AutoStartUp attribute of the userApplication. For information on the userApplication attribute, see "8 Appendix - Object types" in the "PRIMECLUSTER Reliant Monitor Service (RMS) with Wizard Tools Configuration and Administration Guide."
The operator intervention request messages (message numbers: 1422, 1423) and the error resource messages incurred after a resource or system error occurs are displayed only when yes(1) is set for the PersistentFault attribute of the userApplication. For information on the userApplication attribute, see "8 Appendix - Object types" in the "PRIMECLUSTER Reliant Monitor Service (RMS) with Wizard Tools Configuration and Administration Guide."
The operator intervention request and error resource messages are displayed by using the "clwatchlogd" daemon to monitor switchlog. You need to send the SIGHUP signal to clwatchlogd when you change the value of RELIANT_LOG_PATH that is defined in the "hvenv.local" file. When clwatchlogd receives this signal, clwatchlogd acquires the latest value of RELIANT_LOG_PATH. After you change RELIANT_LOG_PATH, you must start RMS.

Contents Index