7.4.2 Corrective Action when Patrol Diagnosis Detects a Fault

This section explains the actions to take when the patrol diagnosis facility detects a failure

Use one of the following methods to identify the faulted hardware:

Message text output to the CRM main window or syslogd(1M)
See "Display format 1" in "D.1 Searching for a Message."
CRM main window
The CRM main window displays the OFF-FAIL state for the faulted hardware. See "7.1.2 CRM Main Window."
"clgettree(1)" command
The "clgettree(1)" command displays the OFF-FAIL state for the faulted hardware. See the manual page for clgettree(1).

Correct the faulted hardware according to the operation procedure below.

Operation Procedure:

Stop the node to which the faulted hardware is connected.
Repair the faulted hardware.
Start the node.
Note
When a disk unit that is registered with GDS is to be exchanged, follow the steps described in the GDS disk replacement procedure. For information on GDS disk replacement, see "Disk Unit Error" in the "PRIMECLUSTER Global Disk Services Configuration and Administration Guide."
Check that the faulted hardware has recovered using one of the following methods:
1. Use the CRM main window.
2. Execute the "clgettree(1)" command.
If the above procedure shows that the fault was not corrected, you need to continue the following procedure:
1. Execute the diagnosis operation for the faulted hardware from the CRM main window
  Then, use the CRM main window to check whether the fault was corrected. If the fault was corrected, the ON state is displayed.
2. Execute the "clsptl(1M)" command to initiate the diagnosis operation.
  The "clsptl(1M)" command has two functions. One function allows you to specify a faulted hardware unit and diagnoses only the specified device. The other function runs batch diagnosis of all shared disk units or all network interface cards. If faults occur in multiple hardware units, it is convenient to use the batch diagnosis function.
  - Example in which a faulted shared disk unit is specified and diagnosis is executed:
```
# /etc/opt/FJSVcluster/bin/clsptl -u generic -n c1t4d4
```
  - Example in which batch diagnosis is executed for all shared disk units:
```
# /etc/opt/FJSVcluster/bin/clsptl -a DISK
```
  Execute the "clgettree(1)" command to check whether the fault was corrected. If the fault was corrected, the ON state will be displayed for the hardware.
Bring the Faulted cluster application Online.
Confirm that the state of the cluster application to which the recovered hardware is registered, either in the RMS main window or with the "hvdisp(1M)" command.
If the cluster application is Faulted, switch the cluster application from the failed to the active state, either in the RMS main window or with the "hvutil(1M)" command. For information on the procedures related to the CRM main window, see "7.2.2.4 Bringing Faulted Cluster Application to Online State."
If operator intervention request is enabled, a message will be displayed with the "syslogd(1M)" command and Cluster Admin when RMS is started. By entering a response to this message, you can switch the state of the cluster application from the failed state to active. For information on the setup procedure for operator intervention requests, see "5.4 Setting Up Fault Resource Identification and Operator Intervention Request."
An example of an operator intervention request is shown below. For details on the messages requesting operator intervention, see "D.7.2 Failed Resource and Operator Intervention Messages (GUI)" and "D.5 Operator Intervention Messages."
```
1422 On the SysNode "node1RMS", the userApplication "app0" is the Faulted state due to a fault in the resource "apl1". 
Do you want to clear fault? (yes/no)
Message number: 1001
```

Note

If "Yes" is set for the "AutoStartUp" attribute, an operator intervention request message will be displayed at node startup. You need to respond to the operator intervention message after executing step 4. of the procedure.

7.4.2 Corrective Action when Patrol Diagnosis Detects a Fault

7.4.2.1 Identifying Faulted Hardware

7.4.2.2 Corrective Action for Faulted Hardware