Top
PRIMECLUSTER  Reliant Monitor Services (RMS) with Wizard Tools Configuration and Administration Guide 4.7

2.1.5 Fault processing

The handling of fault situations is a central aspect of RMS. How RMS reacts to faults differs depending on the state of an application at any particular time. For instance, the reaction to faults that occur in the resource graph of an ongoing application differs from the reaction to faults in the graph of an application that is locally offline.

2.1.5.1 Faults in the online state or request processing

When a detector indicates a fault for an online object whose corresponding userApplication is also online, RMS executes the fault script of the object. An equivalent fault condition occurs if the detector indicates that a previously online object is offline although no request is present.

After the fault script completes, RMS notifies the parents of the fault. The parents also execute their fault scripts and forward the fault message.

A special case is represented by orOp objects, which report a logical OR of their children's states. These react to the fault message only if no child is online. If any child of the parent orOp is online, RMS terminates the fault processing at this point.

If there is no intermediate orOp object that intercepts the fault message, it reaches the userApplication. The userApplication then executes its fault script. There are three possible cases during processing according to the following combinations of the AutoSwitchOver and PreserveState attributes:

AutoSwitchOver includes ResourceFailure

When the AutoSwitchOver attribute includes ResourceFailure, RMS ignores the PreserveState attribute and responds as if only the AutoSwitchOver attribute were set. In this case, the process is as follows:

  1. The userApplication attempts to initiate the switchover procedure. For this purpose, the application on the local node must be set to a defined Offline state. The procedure is the same as that described under offline processing.

  2. When offline processing is successfully completed, an online request is sent to the corresponding userApplication of a remote node (see the "2.1.6 Switch processing"). However, the userApplication is now in the Faulted state - unlike the situation with a normal offline request. This prevents the possibility of an application returning to the node in the event of another switchover.

If a further fault occurs during offline processing; for example, if RMS cannot deconfigure the resource of an object that was notified of a Faulted state, then it does not execute a switchover procedure. RMS does not execute a switchover because it views the resources as being in an undefined state. The userApplication does not initiate any further actions and blocks all external, non-forced requests.

Point

A failure during offline processing that was initiated by a previous fault is called a double fault.
This situation cannot be resolved by RMS and requires the intervention of the system administrator. The following principle is applicable for RMS in this case: Preventing the possible destruction of data is more important than maintaining the availability of the application.
If the application is important, the HaltFlag attribute can be set in the userApplication during the configuration procedure. This attribute ensures that the local node is shut down immediately if RMS cannot resolve a double fault state, provided there is another node available for the application. The other nodes detect this as a system failure, and RMS transfers the applications running on the failed node to the available node.

AutoSwitchOver does not include ResourceFailure and PreserveState=1

In this case, the process is as follows:

  1. The userApplication does not initiate any further activity after the fault script executes.

  2. All objects remain in their current state.

Use the PreserveState attribute if an application can remedy faults in required resources.

AutoSwitchOver does not include ResourceFailure and PreserveState=0 or is not set

In this case, RMS carries out offline processing as a result of the fault, but it does not initiate a switchover after offline processing is complete (successful or not).

Fault during pending switch request

A special case occurs when a switch request causes a fault during offline processing. In this case, RMS carries out a switchover after completing the offline processing that the fault caused (provided that offline processing is successful), even if the AutoSwitchOver attribute is set to No. Switchover had evidently been requested at this time by the system administrator who sent the switch request online. If the ongoing switch request is a direct switch request, the target node of the switchover procedure may not be the node with the highest priority; it is the node explicitly specified in the directed switch request.

See

For more information about the AutoSwitchOver and PreserveState attributes, see the "Appendix D Attributes".

2.1.5.2 Offline faults

Even if an application is not online on a node, RMS still monitors the objects configured in the application's graph. If a detector indicates a fault in one of these objects, the fault is displayed. However, no processing takes place, the fault script is not executed, and no message is sent to the parent.

In this case, it is possible that an andOp object could be offline, even though one of its children is Faulted.

This design was chosen on the principle that mandatory dependencies between the objects in a userApplication graph exist only if the userApplication is to run.

2.1.5.3 AutoRecover attribute

An object of the type gResource that represents a local file system is one example of an object that can enter a Faulted state due to reasons that are easily and automatically remedied. A fault that occurs in the object itself (and not as a result of an input/output fault on an underlying disk) is most likely from a umount command that was erroneously executed. In this case, causing the entire application to be switched over probably would not be the best remedy. Therefore, fault processing would not be the best solution.

For such cases, administrators can configure an object's AutoRecover attribute. If a fault then occurs when the object is online, the online script is invoked before the fault script. If the object enters the Online state again within a specific period after the online script has been executed, fault processing does not take place.

RMS only evaluates the AutoRecover attribute when the object is the cause of the fault, that is, when the cause of the fault is not the fault of a child. Accordingly, RMS only evaluates AutoRecover for objects with a detector. The AutoRecover attribute is not relevant if a fault occurs during request processing or if the object is in the Offline state.

2.1.5.4 Fault during offline processing

A fault occurrence during offline processing does not result in an immediate halt of offline processing at that object. Instead, the fault condition at that point in the tree is stored, and offline processing continues in the normal manner down to the leaf objects. However, the fault is recalled and handled when the success/failure message is propagated to the object on the way upstream to the userApplication. This design avoids race conditions that could occur if the fault were processed immediately.

2.1.5.5 Examples of fault processing

The following are examples of fault processing.

Example 4

The scenario for this example is as follows:

Fault processing is as follows:

  1. The lfs object's gResource detector indicates that its object is offline. Because the corresponding userApplication is online and because there is no offline request, RMS interprets this offline report as a fault and notifies the parent cmd.

    Point

    Reminder: An unexpected Offline state results in a fault.

  2. The cmd object in this example does not have a fault script. The cmd object goes directly to the Faulted state and reports the fault to its parent andOp1.

  3. andOp1 does not have a fault script either, so it also goes directly to the Faulted state, and reports the fault to the parent app object.

  4. The app object then changes to the Faulted state and starts offline processing in preparation for switchover, since its AutoSwitchOver attribute is set to a value other than No.

  5. In this example, assume that the local file system lfs uses the mount point /mnt, and the offline script of lfs consists of the simple instruction umount/mnt. Because /mnt is no longer mounted, this offline script terminates with an exit status other than 0.

  6. Accordingly, offline processing for RMS fails after a fault. A switchover is not possible because the local state remains unclear. RMS waits for the intervention of the system administrator.

A more complex offline script for lfs could check whether the object is still mounted and terminate with an exit status of 0. In this case, RMS could successfully complete offline processing after the fault and switch over to fuji3RMS; all local objects on fuji2RMS would then be offline following successful online processing, and only app would remain in the Faulted state.

Example 5

The scenario is the same as in the previous example, except the AutoRecover attribute is set for the lfs object.

Fault processing is as follows:

  1. The lfs object's gResource detector indicates that its object is offline. Since the corresponding userApplication is online and because there is no offline request, RMS interprets this offline report as a fault (see above).

  2. Since the AutoRecover attribute is set, RMS does not immediately report the fault to the parent cmd object. Instead, RMS starts the lfs object's online script to reverse the unmount procedure.

  3. A few seconds later, the lfs object's gResource detector reports that the object is once again online. RMS returns the object to the Online state, and no further fault processing takes place.

Example 6

In this scenario, app receives an online request, but the file system represented by lfs has been corrupted.

Fault processing is as follows:

  1. Online processing starts as a result of the request.

  2. The lfs object starts its online script, which terminates with an exit status other than 0.

  3. The lfs object then initiates fault processing: it starts its fault script (if one is configured), changes to the Faulted state, and notifies RMS of the fault.

  4. The rest of the process proceeds in the same manner as described as above.

    Point

    Fault processing in this case would be the same even if the AutoRecover attribute were set. This attribute is only significant if the application is in a stable Online state, that is, the application is online and there is no pending request.

2.1.5.6 Fault clearing

After successful offline processing due to a fault occurrence, the resource objects will be offline, and the userApplication object will be faulted. If offline processing fails as a result of the fault, or if the application's PreserveState attribute is set, at least part of the graph may remain in a state other than Offline, i.e., Online, Standby, or Faulted.

In all of the above states, the userApplication prevents switch requests to this host, because the base monitor assumes that at least some of the resources are not available. After the system administrator has remedied the cause of the fault, one of the following procedures can be used to notify the base monitor so that RMS can resume normal operation:

  1. The following command may be used to clear the faulted state of the userApplication object and the objects in its graph:

    hvutil -c userApplication

    This command attempts to clear the fault by switching the parent application and its graph into a self-consistent state: if the application object is online, then online processing will be initiated; if the application object is offline, then offline processing will be initiated. (The user is notified about which type of processing will occur and given a chance to abandon the operation.) The fault clears successfully when every branch leading to the application reaches the same online or offline state. If the final state is offline, the system administrator can set the userApplication to the online state with a switch request.

    If the userApplication object is initially online, invoking 'hvutil -c' may not affect every object in the tree. To initiate offline processing for the entire tree, use 'hvutil -f' as described below.

  2. The following command initiates an offline request to the userApplication object:

    hvutil -f userApplication

    This starts offline processing for the application. If the command completes successfully, the application and every object in its graph are switched to the offline state, and the fault is cleared. If required, the system administrator can set the userApplication to the online state with a switch request.

2.1.5.7 SysNode faults

RMS handles a fault that occurs in a SysNode in a different manner than faults in any other type of object. A SysNode fault occurs under the following conditions:

When either of these events happen, RMS must first ensure that the remote node is actually down before automatic switchover occurs. To accomplish this, RMS uses the Shutdown Facility (SF). For more information about the Shutdown Facility and shutdown agents, see "PRIMECLUSTER Cluster Foundation (CF) Configuration and Administration Guide."

Once the shutdown of the cluster node is verified by the SF, all userApplication objects that were Online on the affected cluster node, and whose AutoSwitchOver attribute includes HostFailure, are priority switched to surviving cluster nodes.

Example 7

The scenario for this example is as follows:

The reaction of RMS is as follows:

  1. CF determines that a node failure has occurred and generates a LEFTCLUSTER event.

  2. RMS puts the SysNode in a Wait state. RMS receives the LEFTCLUSTER event and sends a kill request to SF.

  3. After SF successfully kills the node, a DOWN event is sent.

  4. RMS receives the DOWN event and marks the SysNode as Faulted

  5. The fuji2RMS object executes its fault script (assuming that such a script has been configured).

  6. The fuji2RMS object notifies the userApplication objects that fuji2RMS has failed. Since app was online on fuji2RMS when fuji2RMS failed, and since its AutoSwitchOver attribute includes the HostFailure setting, the object app on fuji2RMS starts online processing.

Operator intervention

If the Shutdown Facility is engaged to kill a node, but the duration of the SysNode object's Wait state exceeds the object's ScriptTimeout limit, RMS records an ERROR message in the switchlog to this effect.

At this point, one cluster node is now in an undefined state, so RMS blocks all further action on all other nodes. This situation is usually resolved only by operator intervention as described in "PRIMECLUSTER Cluster Foundation (CF) Configuration and Administration Guide." Upon successful completion of the procedure, CF sends a DOWN event, RMS resolves the blocked state, and normal operation resumes.

For more information about the ScriptTimeout attribute, see the "Appendix D Attributes".