F.9 RMS troubleshooting

When problems occur, RMS prints out meaningful error messages that will assist you in troubleshooting the cause. If no message is available, the following information may help you diagnose and correct some unusual problems:

RMS dies immediately after being started.
At startup, the RMS base monitor exchanges its configuration checksum with the other base monitors on remote nodes. If the checksum of the starting base monitor matches the checksums from the remote nodes, the startup process continues. If the checksums do not match, then the RMS base monitor shuts down if all of the following conditions are true:
1. The base monitor has encountered a different checksum from a remote monitor within the initial startup period (see "HV_CHECKSUM_INTERVAL").
2. There are no applications on this node that are online, waiting, busy, or locked.
3. There are no online remote base monitors encountered by this base monitor.
  Otherwise, the base monitor keeps running, but all remote monitors whose checksums do not match the local configuration checksum are considered to be offline. Therefore, no message exchange is possible with these monitors, and no automatic or manual switchover will be possible between the local monitor and these remote monitors.
When different checksums are encountered, certain messages are placed in the switchlog explaining the situation.
Note
Configuration checksum differences often result when global environment variables are changed manually but inconsistently on different nodes.
Action:
To verify that a configuration checksum difference is not the cause of the problem, ensure that all the nodes have been updated with the proper configuration by using the following procedure:
1. Stop all RMS in the cluster.
2. Determine which configuration to run. Use 'hvdisp -a' or 'hvdisp -T SysNode' on each node to verify the name of the configuration file. (The hvdisp command does not require root privilege.)
  A configuration may have the same name but different contents on two or more nodes if one of the following has occurred:
  - When a previous RMS configuration distribution fails
  - When RMS Wizard Tools are used on multiple nodes in the cluster
3. Activate the correct configuration with the same tool (Wizard Tools) that was used to create it. For the correct procedure, see the section "3.4 Activating a configuration".
  Alternatively, redistribute the existing <configname>.us file with one of the following methods:
  - In the RMS Wizard Tools, use Configuration Push.
  Make sure the activation is successful so that all the nodes are updated.
4. Start RMS on all of the nodes in the cluster.
RMS hangs after startup (processes are running, but hvdisp hangs)
This problem might occur if the local node is in the CF state LEFTCLUSTER from the point of view of one or more of the other nodes in the cluster.
Action:
Verify the problem by using 'cftool -n' on all cluster nodes to check for a possible LEFTCLUSTER state.
Use 'cftool -k' to clear the LEFTCLUSTER state. RMS will continue to run as soon as the node has joined the cluster. No restart should be necessary.
RMS loops (or even dies) shortly after being started.
This problem could occur if the CIP configuration file /etc/cip.cf contains entries for the netmask. These entries are useless (not evaluated by CIP). From the RMS point of view these entries cannot be distinguished from IP addresses, which have the same format, so RMS will invoke a gethostbyaddr(). This normally does no harm, but in some unusual cases the OS may become confused.
Action:
Verify the problem by checking if netmask entries are present in /etc/cip.cf.
Remove the netmask entries, and restart RMS.
RMS detects a failure of another node, but the node is not killed.
This problem could occur if SysNode is in the Wait state.
Action:
Check the CF state by using cftool -n.
If the CF state is LEFTCLUSTER, manually stop the LEFTCLUSTER node, and then clear the LEFTCLUSTER state by using cftool -k.
Once CF state changes from LEFTCLUSTER to DOWN, execute "hvdisp -T SysNode" to see the state of all SysNode objects.
If there is any SysNode in Wait state, execute "hvutil -u SysNode" to clear the Wait state.
Note
When using the cftool -k command and the hvutil -u command, you must manually stop the node in the Wait state before executing the commands. The execution of these commands may invoke the failover of the applications. Therefore, if you execute these commands without stopping the node in the Wait state, this may cause data corruption.
The RMS base monitor detects a loss of detector heartbeat, but there is no indication as to the reason for the loss.
In this case, system administrators collect the information as follows.
1. Invokes truss(1) or strace(1) to trace the detector process
2. Turns on full RMS and detector logging with the -l0 (lowercase "L", zero) option
3. Gathers system and users times for the process
The truss(1)/strace(1) invocation and logging levels will be terminated after the number of seconds specified in the ScriptTimeout attribute. All information is stored in the switchlog file.
Note that user-specified operations such as detector tracing will continue on each node, even if it appears to have left the cluster. In extreme cases, turning on high detail trace levels may affect the performance of a node and contribute to delays in its base monitor heartbeat.
Action:
See the switchlog for the diagnostic information.