8.1 Beginning the process

Start the troubleshooting process by gathering information to help identify the causes of problems. You can use the CF log viewer facility from the Cluster Admin GUI, look for messages on the console, or look for messages in the

/var/log/messages file. You can use the cftool(1M) command for checking states, configuration information. To use the CF log viewer click on the Tools pull-down menu and select View Syslog messages (refer to the Section "4.8 Using PRIMECLUSTER log viewer" for more details). The log messages are displayed. You can search the logs using a date/time filter or scan for messages based on severity levels. To search based on date/time, use the date/time filter and press the Filter button. To search based on severity levels, click on the Severity button and select the desired severity level. You can use keyword also to search the log. To detach the CF log viewer window, click on the Detach button; click on the Attach button to attach it again.

Collect information as follows:

Look for messages on the console that contain the identifier CF.
Look for messages in /var/log/messages. You might have to look in multiple files (/var/log/messages.N).
Use cftool as follows:
- cftool -l: Check local node state
- cftool -d: Check device configuration
- cftool -n: Check cluster node states
- cftool -r: Check the route status

Error log messages from CF are always placed in the /var/log/messages file; some messages may be replicated on the console. Other device drivers and system software may only print errors on the console. To have a complete understanding of the errors on a system, both console and error log messages should be examined. The section "4.5 Error Messages" in "PRIMECLUSTER Messages" contains messages that can be found in the /var/log/messages file. This list of messages gives a description of the cause of the error. This information is a good starting point for further diagnosis.

All of the parts of the system put error messages in this file or on the console and it is important to look at all of the messages, not just those from the PRIMECLUSTER suite. The following is an example of a CF error message from the /var/log/messages file:

Aug 26 13:31:05 fuji2 kernel:  LOG3.0429320    1080024   100014    0    1.0 CF: Giving UP Mastering (Cluster already Running)

The parts of this message are as follows:

The first 80 bytes are the log3 prefix:

Aug 26 13:31:05 fuji2 kernel:  LOG3. .0429320   1080024   100014    0    1.0         cf:elmlog

This parts of the message is a standard prefix on each CF message in the log file that gives the date and time, the node name, and log3 specific information. Only the date, time, and node name are important in this context. The remainder is the error message from CF as follows:

CF: Giving UP Mastering (Cluster already Running).

When the node detects a joined server, and it enters an existing cluster instead of making a new cluster, this message is output. Refer to "Chapter 5 CF Messages" in "PRIMECLUSTER Messages" for details of the message.

Several options for the command cftool(1M) are available as sources for information. The following is an example:

root@fuji2> cftool -l
Node     Number   State  Os        Cpu       Flags
fuji2    2        UP     Linux     Pentium   0

This shows that the local node has joined a cluster as node number 2 and is currently UP. This is the normal state when the cluster is operational. Another possible response is as follows:

root@fuji2> cftool -l
Node     Number   State        Os   Cpu    Flags
fuji2    --       COMINGUP     --   --

This indicates that the CF driver is loaded and that the node is attempting to join a cluster. If the node stays in this state for more than a few minutes, then something is wrong and we need to examine the /var/log/messages file. In this case, we see as follows:

root@fuji2> tail /var/log/messages
Aug 28 10:38:25 fuji2 kernel:  CF: (TRACE): Load: Complete.
Aug 28 10:38:25 fuji2 kernel:  CF: (TRACE): JoinServer: Startup.
Aug 28 10:38:25 fuji2 kernel:  CF: Giving UP Mastering (Cluster already Running).
Aug 28 10:38:25 fuji2 kernel: CF: fuji2: busy: local node not DOWN: retrying.

We see that this node is in the LEFTCLUSTER state on another node (fuji4). To resolve this condition, see Chapter "5.1 Description of the LEFTCLUSTER state" for a description of the LEFTCLUSTER state and the instructions for resolving the state.

The next option to cftool(1M) shows the device states as follows:

root@fuji2> cftool -d
Number  Device  Type  Speed    Mtu   State  Configured  Address
1       eth0    4     100      1432   UP     YES         00.03.47.c2.a8.82
2       eth1    4     100      1432   UP     YES         00.02.b3.88.09.f1
3       eth2    4     100      1432   UP     NO          00.02.b3.88.09.ea

Here we can see that there are two interconnects configured for the cluster (the lines with YES in the Configured column). This information shows the names of the devices and the device numbers for use in further troubleshooting steps.

The cftool -n command displays the states of all the nodes in the cluster. The node must be a member of a cluster and UP in the cftool -l output before this command will succeed:

root@fuji2> cftool -n
Node        Number   State       Os       Cpu
fuji2       1        UP         Linux    Pentium 
fuji3       2        UP         Linux    Pentium

This indicates that the cluster consists of two nodes fuji2 and fuji3, both of which are UP. If the node has not joined a cluster, the command will wait until the join succeeds.

cftool -r lists the routes and the current status of the routes as follows:

root@fuji2> cftool -r
Node        Number  Srcdev  Dstdev  Type  State  Destaddr
fuji2       1       1       4       4     UP     00.03.47.c2.a8.82
fuji2       1       1       5       5     UP     00.03.47.c2.a8.cc
fuji3       2       2       4       4     UP     00.03.47.d1.af.ec
fuji3       2       2       5       5     UP     00.03.47.d1.af.ef

This shows that all of the routes are UP. If a route shows a DOWN state, then the step above where we examined the error log should have found an error message associated with the device. At least the CF error noting the route is down should occur in the error log. If there is not an associated error from the device driver, then the diagnosis steps are covered below.

The last route to a node is never marked DOWN, it stays in the UP state so that the software can continue to try to access the node. If a node has left the cluster or gone down, there will still be an entry for the node in the route table and one of the routes will still show as UP. Only the cftool -n output shows the state of the nodes. The following example shows:

root@fuji2> cftool -r
Node        Number  Srcdev  Dstdev  Type  State  Destaddr
fuji3       2       3       2       4     UP     00.03.47.d1.af.ec
fuji2       1       3       3       4     UP     00.03.47.c2.a8.82

root@fuji2> cftool -n
Node        Number  State         Os       Cpu
fuji3       1       LEFTCLUSTER   Linux    Pentium 
fuji2       2       UP            Linux    Pentium