2.3.5 PRIMECLUSTER SF

The PRIMECLUSTER Shutdown Facility (SF) provides a function to guarantee that other nodes are shut down during error processing such as when contention for user resources occurs in a cluster system.

Note

When CF confirms that a cluster node has restarted and can guarantee that the node was shut down before, PRIMECLUSTER SF does not shut down the node.

PRIMECLUSTER SF is made up of the following major components:

Shutdown Daemon (SD)
The SD monitors the state of cluster machines and provides an interface for gathering status and requesting manual machine shutdown.
One or more Shutdown Agents (SA)
The SA's role is to guarantee the shutdown of a remote cluster node.
MA (asynchronous monitoring)
In addition to the SA, the MA monitors the state of remote cluster nodes and immediately detects failures in those nodes.
The route for forcibly stopping the cluster node is checked regularly (every 10 minutes).

Shutdown Agents (SA)

The SA guarantees a reliable suspension of the remote cluster node. The SA varies depending on the architecture of each cluster node.

The SA provides the following functions:

Forcibly shutting down a failure node
The SA guarantees to shut down a failure node.
Checking a connection with the optional hardware (Shutdown Agent testing)
The SA periodically (every ten minutes) checks the proper connection with the optional hardware that shuts down a node.

The PRIMECLUSTER Shutdown Facility provides the following Shutdown Agents:

RCI (SA_pprcip, SA_pprcir): Remote Cabinet Interface
This SA uses the RCI, which is one of the hardware units installed in SPARC Enterprise M-series, to stop other nodes with certainty by intentionally triggering a panic or reset in those nodes.
XSCF (SA_xscfp, SA_xscfr, SA_rccu, SA_rccux): eXtended System Control Facility
The SA uses the XSCF, which is one of the hardware units installed in SPARC Enterprise M-series, to stop other nodes with certainty by intentionally triggering a panic or reset in those nodes.
If the XSCF is being used in the console, the Shutdown Facility stops other nodes with certainty by sending the break signal to those nodes.
XSCF SNMP (SA_xscfsnmpg0p, SA_xscfsnmpg1p, SA_xscfsnmpg0r, SA_xscfsnmpg1r, SA_xscfsnmp0r, SA_xscfsnmp1r)
eXtended System Control Facility Simple Network Management Protocol
The SA uses the XSCF, which is one of the hardware units installed in SPARC M10, M12 to stop other nodes with certainty by intentionally triggering a panic or reset in those nodes.
ALOM (SA_sunF): Advanced Lights Out Management
The SA uses ALOM of SPARC Enterprise T1000, T2000 to stop other nodes with certainty by sending the break signal to those nodes.
ILOM (SA_ilomp, SA_ilomr): Integrated Lights Out Manager
The SA uses ILOM of SPARC Enterprise T5120, T5220, T5140, T5240, T5440, SPARC T3, T4, T5, T7, S7 series to stop other nodes with certainty by intentionally triggering a panic or reset in those nodes.
KZONE (SA_kzonep, SA_kzoner, SA_kzchkhost)
Oracle Solaris Kernel Zones
If Oracle Solaris Kernel Zones are used with SPARC M10, M12 and SPARC T4, T5, T7, S7 series, the node can be completely stopped by intentionally panicking or resetting another node (Kernel Zone).
The status of the global zone host is also checked so that when the global zone host is stopped, another node (Kernel Zone) is determined to be stopped. The global zone host is not forcibly stopped.
BLADE (SA_blade)
This SA, which can be used in the PRIMERGY blade server, uses the SNMP command to stop other nodes with certainty by shutting them down.
IPMI (SA_ipmi): Intelligent Platform Management Interface
This SA uses the IPMI to operate iRMC (integrated Remote Management Controller), which is one of the hardware modules installed in PRIMERGY, and stop other nodes with certainty by shutting them down.
kdump (SA_lkcd)
The SA uses kdump in PRIMERGY or the PRIMERGY blade server to stop other nodes with certainty by intentionally triggering a panic.
MMB (SA_mmbp, SA_mmbr): Management Board
This SA uses the MMB, which is one of the hardware units installed in PRIMEQUEST 2000, to forcibly stop other nodes with certainty by intentionally triggering a panic or reset in those nodes.
iRMC (SA_irmcp, SA_irmcr, SA_irmcf)
This SA uses iRMC / MMB, which are the hardware units installed in PRIMEQUEST 3000, to forcibly stop other nodes with certainty by intentionally triggering a panic, reset or shutting off the power in those nodes.

Note

This SA is not available in PRIMERGY iRMC.

ICMP (SA_icmp)
The SA uses the network path to check other nodes. If no response is received from other nodes, the SA determines that nodes are shut down.
Other nodes are not forcibly shut down.
The figure below shows an example of state confirmation by SA_icmp if one node (Node 2) goes down in a cluster system with two nodes.
If no response is received from Node 2 through all specified network paths, SA_icmp determines that Node 2 is shut down.
Figure 2.3 State confirmation by SA_icmp if the other node goes down
The figure below shows an example of state confirmation by SA_icmp if the cluster interconnect fails in a cluster system with two nodes.
If Node 1 receives a response from Node 2 on any of specified network path, SA_icmp determines that Node 2 is running.
In this case, Node 2 is not forcibly shut down by SA_icmp.
Figure 2.4 State confirmation by SA_icmp if the cluster interconnect fails
VMCHKHOST (SA_vmchkhost)
When the cluster system is installed in the host OS with the KVM machine, the SA checks the status of guest OSes together with the cluster system of the host OS.
Other nodes are not forcibly shut down.
libvirt (SA_libvirtgp, SA_libvirtgr)
When using a KVM virtual machine in PRIMERGY, the PRIMERGY blade server, and PRIMEQUEST 3000/2000 series, the SA stops other nodes with certainty by intentionally triggering a panic or reset in those nodes.
VMware vCenter Server functional cooperation (SA_vwvmr)
Cooperating with VMware vCenter Server, the SA stops other nodes (guest OSes) with certainty by intentionally powering off.
FUJITSU Cloud Service OSS API (SA_vmk5r)
This shutdown agent enables reliable node shutdown by intentionally shutting down or powering off the remote node (instance) using the FUJITSU Cloud Service OSS API.
OpenStack API (SA_vmosr)
This shutdown agent enables reliable node shutdown by intentionally restarting the remote node (instance) using the OpenStack API.

MA (Monitoring Agent)

The Monitoring Agent (MA) has the capability to monitor the state of a system and promptly detect a failure such as system panic and shutdown. This function is provided by taking advantage of the hardware features that detect the state transition and inform the upper-level modules.

Without the MA, the cluster heartbeat time-out detects only a communication failure during periodic intervals. The MA allows the PRIMECLUSTER system to quickly detect a node failure.

The MA provides the following functions:

Monitoring a node state
The MA monitors the state of the remote node that uses the hardware features. It also notifies the Shutdown Facility (SF) of a failure in the event of an unexpected system panic and shutoff. Even when a request of responding to heartbeat is temporarily disconnected between cluster nodes because of an overloaded system, the MA recognizes the correct node state.
Forcibly shutting down a failure node
The MA provides a function to forcibly shut down the node as Shutdown Agent (SA).
Checking a connection with the optional hardware (Shutdown Agent testing)
The MA provides a function as the SA (Shutdown Agent). It periodically (every ten minutes) checks the proper connection with the optional hardware that monitors a node state or shuts down a node.

PRIMECLUSTER SF provides the following Monitoring Agents:

RCI Monitoring Agents (SPARC Enterprise M Series): The MA monitors the node state and detects a node failure by using the SCF/RCI mounted on SPARC Enterprise M-series. The System Control Facility (SCF), which is implemented on a hardware platform, monitors the hardware state and notifies the upper-level modules. The MA assures node elimination and prevents access to the shared disk.

Console Monitoring Agents (Available server models are limited to SPARC Enterprise M-series and most of SPARC Enterprise T-series.)

The console monitoring agent monitors message output to the console of each node using XSCF/ILOM. If an error message of a node failure is output to one node, the other node detects the message and notifies SF of a node failure. Normally, the console monitoring agent creates a loop, monitoring another node, for example, A controls B, B controls C, and C controls A. If one node goes down because of a failure, another node takes over the monitoring role instead of this failed node.

The console monitoring agent also ensures node elimination by sending a break signal to the failed node.

The figure below shows how the monitoring feature is taken over in a cluster system with three nodes if one node goes down. The arrow indicates that a node monitors another node.

Figure 2.5 MA normal operation

When a failure occurs, and Node 2 is DOWN, the following actions occur:

Node 1 begins to monitor Node 3.

The following message is output to the /var/adm/messages file of Node 1:

FJSVcluster: INFO: DEV: 3044: The console monitoring agent took over monitoring (node: targetnode)

The figure below shows how Node 1 added Node 3 as the monitored node when Node 2 went down.

Figure 2.6 MA operation in the event of node failure

Note

If monitoring function is taken over while the console monitoring agent is stopped, the stopped console monitoring agent is resumed.

When Node 2 recovers from the failure and starts, the following actions occur:

The original monitoring mode is restored.

The following message is output to the /var/adm/messages file of Node 1:

FJSVcluster: INFO: DEV: 3045: The console monitoring agent cancelled to monitor (node: targetnode)

The figure below shows how Node 2 returns to monitoring Node 3 once it has been restored to the cluster.

Figure 2.7 Node recovery

The following are possible messages that might be found in the /var/adm/messages file:

FJSVcluster: INFO: DEV: 3042: The RCI monitoring agent has been started
Indicates that the RCI monitoring agent is enabled.
FJSVcluster: INFO: DEV: 3043: The RCI monitoring agent has been stopped.
Indicates that the monitoring feature is disabled.
FJSVcluster: INFO: DEV: 3040: The console monitoring agent has been started (node:monitored node name)
Indicates that the monitoring feature of the console monitoring agent is enabled.
FJSVcluster: INFO: DEV: 3041: The console monitoring agent has been stopped (node:monitored node name)
Indicates that the monitoring feature of the console monitoring agent is disabled. When the monitoring feature is not enabled, the other feature that forcibly brings the node DOWN might not work.

Note

The console monitoring agent monitors the console message of the remote node. So it cannot recognize the node state in the event of an unexpected shutdown. In such a case, the node goes into the LEFTCLUSTER state, and you need to mark the remote node DOWN. For how to mark a node with DOWN, see "PRIMECLUSTER Cluster Foundation (CF) Configuration and Administration Guide."

SNMP asynchronous monitoring (SPARC M10, M12)

This function monitors the node state by using the eXtended System Control Facility (XSCF) installed in the SPARC M10, M12.

The function can ascertain node failures by having the XSCF report the node state to the software using SNMP (Simple Network Management Protocol).

This function can intentionally trigger a panic or a reset in other nodes to forcibly stop those nodes with certainty and prevent contention over user resources.

MMB asynchronous monitoring (PRIMEQUEST 2000)

This function uses the MMB, which is one of the hardware units installed in PRIMEQUEST 2000, to monitor nodes. The function can ascertain node failures by having the MMB, which is one of the standard units installed in the hardware, report the node state to the software.

This function can intentionally trigger a panic or a reset in other nodes to forcibly stop those nodes with certainty and prevent contention over user resources.

iRMC asynchronous monitoring (PRIMEQUEST 3000): This function uses the iRMC and MMB, which are the hardware units installed in PRIMEQUEST 3000, to monitor nodes. The function can ascertain node failures by having the iRMC and MMB, the standard units installed in the hardware, report the node state to the software. This function can intentionally trigger a panic, reset or shutting off the power in other nodes to forcibly stop those nodes with certainty and prevent contention over user resources.

Note

This SA is not available in PRIMERGY iRMC.

Note

Node state monitoring of the RCI asynchronous monitoring function operates from when message (a) shown below is output until message (b) is output.
The messages for the console asynchronous monitoring function are messages (c) and (d).
The messages for the SNMP asynchronous monitoring function are messages (e) and (f).
The messages for the MMB asynchronous monitoring function are messages (g) and (h).
The messages for the iRMC asynchronous monitoring function are messages (i) and (j).
When node state monitoring is disabled, the function that forcibly stops nodes may not operate normally.

(a) FJSVcluster: INFO: DEV: 3042: The RCI monitoring agent has been started.

(b) FJSVcluster: INFO: DEV: 3043: The RCI monitoring agent has been stopped.

(c) FJSVcluster: INFO: DEV: 3040: The console monitoring agent has been started (node:monitored node name).

(d) FJSVcluster: INFO: DEV: 3041: The console monitoring agent has been stopped (node:monitored node name).

(e) FJSVcluster: INFO: DEV: 3110: The SNMP monitoring agent has been started.

(f) FJSVcluster: INFO: DEV: 3111: The SNMP monitoring agent has been stopped.

(g) FJSVcluster: INFO: DEV: 3080: The MMB monitoring agent has been started.

(h) FJSVcluster: INFO: DEV: 3081: The MMB monitoring agent has been stopped.

(i) FJSVcluster: INFO: DEV: 3120: The iRMC asynchronous monitoring agent has been started.

(j) FJSVcluster: INFO: DEV: 3121: The iRMC asynchronous monitoring agent has been stopped.