C.3.4 Corrective action when the sfcfrmd daemon is not started

This section describes corrective actions when the sfcfrmd daemon is not started during node is started up, the node enters multi-user mode, or CF is started by the GUI.

For integrated file system access being maintained, startup of the sfcfrmd daemon will be suspended until a quorum exists.

If activation of the daemon is suspended, the following message will be output;

WARNING: sfcfsrm:5001: Starting the sfcfrmd daemon was suspended because quorum dose not exist

Normally, no corrective action should be taken to activate the sfcfrmd daemon because it will be started as soon as a quorum exists.

In the following cases, a quorum does not exist, so corrective action should be taken to start operation of the GFS Shared File System.

A cluster partition error exists.
When all the cluster nodes or CF are stopped then part of nodes or CF are started and operated due to a failure.

If GFS cannot be operated because the sfcfrmd daemon is not activated, using the following steps:

Procedure 1. Check the state of all the cluster nodes.

Connect all the operating nodes and check if the same state is displayed using the cftool(1M) command or Cluster Admin GUI.

# cftool -n <Enter>
Node  Number State       Os      Cpu
sunny 1      UP          Linux   Pentium
monny 2      UP          Linux   Pentium

If the node state is different on all the operating nodes, a cluster partition error exists.

See

For details about cftool(1M), see "Node details" of "PRIMECLUSTER Cluster Foundation (CF) Configuration and Administration Guide."

Procedure 2. If a cluster partition error occurs, take the following corrective action:

When a LEFTCLUSTER node exists
If Shutdown Facility (SF) is running properly on all the cluster nodes, it will solve the cluster partition error, so no corrective action is necessary. If SF is not running properly, or forced shutdown of the node through SF fails, you need to recover the node manually. Take corrective action according to "Caused by a cluster partition" of "PRIMECLUSTER Cluster Foundation (CF) Configuration and Administration Guide."
When a LEFTCLUSTER node does not exist
- Perform a CF or node restart
  Take corrective action according to "Join-related problems" of "PRIMECLUSTER Cluster Foundation (CF) Configuration and Administration Guide."
- Activate all the CF's
  After stopping all the CFs, solve the cluster partition error then restart CF according to "12.6 How to start up CF by the GUI when a GFS Shared File System is used."
- Activate all nodes
  After stopping all nodes by shutdown(8), solve the cluster partition error then restart all nodes.
  Note
  If startup of a node is suspended, complete the startup of a node using operations below, before executing shutdown(8) on each node.
  1. Confirm the process ID of the sfcfsrm script by ps(1).
    [RHEL6]
    The beginning column of the output below shows a process ID of the sfcfsrm script.
    # /bin/ps -e | /bin/grep sfcfsrm <Enter> 18550 ? 00:00:00 S57sfcfsrm
    [RHEL7]
    The second column of the output below shows a process ID of the sfcfsrm script.
    # /bin/ps -ef | /bin/grep sfcfsrm | /bin/grep -v grep <Enter> root 18550 1 0 02:35 ? 00:00:00 /bin/bash -c /opt/FJSVsfcfs/lib/systemd/sfcfsrm start
  2. Stop the sfcfsrm script by kill(1).
    # /bin/kill -9 18550 <Enter>
Information
If you want to resume operation immediately with a part of the nodes, take the following action:
1. Decide which nodes will be used in the cluster after performing "Procedure 1" (above).
  Typically, you will choose the nodes that have the largest number of nodes in it or the one where the most important hardware is connected or the most important application runs. In addition, the nodes that will be used need to include at least one MDS node for using the GFS Shared File System.
2. Stop all nodes that will not be used with shutdown(8).
  If startup of a node is suspended, complete the startup of a node before executing shutdown(8). Procedure is shown [Note] (above).
3. Check the state of all the nodes that will be used using "Procedure 1" again. The state of all the nodes must be the same.
4. Forcibly restart the sfcfrmd daemon that has been suspended by executing the sfcfrmstart(8) command on all the nodes that will be used.
  An example of sfcfrmstart(8) is shown "Procedure 3" (below).
See
For details about shutdown(8), ps(1), grep(1) and kill(1), see the online manual page.

Procedure 3. If a cluster partition error does not exist, forcibly restart the sfcfrmd daemon that has been suspended by executing the sfcfrmstart(8) command on all the nodes where operation will be resumed.

# sfcfrmstart -f <Enter>