9.1 Beginning the process

Start the troubleshooting process by gathering information to help identify the causes of problems.You can use the CF log viewer facility from the Cluster Admin GUI, look for messages on the console, or look for messages in the /var/adm/messages file. You can use the cftool(1M) command for checking states, configuration information. To use the CF log viewer click on the Tools pulldown menu and select View Syslog messages. The log messages are displayed. You can search the logs using a date/time filter or scan for messages based on severity levels. To search based on date/time, use the date/time filter and press the Filter button. To search based on severity levels, click on the Severity button and select the desired severity level. You can use keyword also to search the log. To detach the CF log viewer window, click on the Detach button; click on the Attach button to attach it again.

Collect information as follows:

Look for messages on the console that contain the identifier CF.
Look for messages in /var/adm/messages. You might have to look in multiple files (/var/adm/messages.N).
Use cftool as follows:
- cftool - l: Check local node state
- cftool -d: Check device configuration
- cftool -n: Check cluster node states
- cftool -r: Check the route status

Error log messages from CF are always placed in the /var/adm/messages file; some messages may be replicated on the console. Other device drivers and system software may only print errors on the console. To have a complete understanding of the errors on a system, both console and error log messages should be examined. "4.5 Error Messages" in "PRIMECLUSTER Messages" contains messages that can be found in the /var/adm/messages file. This list of messages gives a description of the cause of the error. This information is a good starting point for further diagnosis.

All of the parts of the system put error messages in this file or on the console and it is important to look at all of the messages, not just those from the PRIMECLUSTER suite. The following is an example of a CF error message from the /var/adm/messages file:

Nov  9 08:51:45 fuji2 unix:  LOG3.0973788705 1080024   1008 4    0    1.0         cf:ens CF: Icf Error: (service err_type route_src route_dst). (0 0 0 0 0 0 0 0 2 0 0 0 5 0 0 0 5)

The first 80 bytes are the log3 prefix as in the following:

Nov  9 08:51:45 fuji2 unix:  LOG3.0973788705 1080024   1008 4    0    1.0         cf:ens

This part of the message is a standard prefix on each CF message in the log file that gives the date and time, the node name, and log3 specific information. Only the date, time, and node name are important in this context. The remainder is the error message from CF as in the following:

CF: Icf Error: (service err_type route_src route_dst). (0 0 0 0 0 0 0 0 2 0 0 0 5 0 0 0 5)

This message is from the cf:ens service (that is, the Cluster Foundation, Event Notification Service) and the error is CF: Icf Error. This error is described in "5.1.4 Error Messages" in "PRIMECLUSTER Messages" as signifying a missing heartbeat and/or a route down. This gives us direction to look into the cluster interconnect further. A larger piece of the /var/adm/messages file shows as follows:

fuji2# tail /var/adm/messages
Nov  9 08:51:45 fuji2 unix: SUNW,pci-gem1: Link Down - cable problem?
Nov  9 08:51:45 fuji2 unix: SUNW,pci-gem0: Link Down - cable problem?
Nov  9 08:51:45 fuji2 unix:  LOG3.0973788705 1080024   1008 4    0    1.0         cf:ens          CF: Icf Error: (service err_type route_src route_dst). (0 0 0 0 0 0 0 0 2 0 0 0 5 0 0 0 5)
Nov  9 08:51:46 fuji2 unix: SUNW,pci-gem0: Link Down - cable problem?
Nov  9 08:51:48 fuji2 last message repeated 1 time
Nov  9 08:51:48 fuji2 unix:  LOG3.0973788708 1080024   1008 4    0    1.0         cf:ens          CF: Icf Error: (service err_type route_src route_dst). (0 0 0 0 0 0 0 0 2 0 0 0 4 0 0 0 4)
Nov  9 08:51:50 fuji2 unix: SUNW,pci-gem0: Link Down - cable problem?
Nov  9 08:51:52 fuji2 last message repeated 1 time
Nov  9 08:51:53 fuji2 unix:  LOG3.0973788713 1080024   1008 4    0    1.0         cf:ens          CF: Icf Error: (service err_type route_src route_dst). (0 0 0 0 0 0 0 0 2 0 0 0 4 0 0 0 4)
Nov  9 08:51:53 fuji2 unix:  LOG3.0973788713 1080024   1015 5    0    1.0         cf:ens          CF: Node fuji2 Left Cluster POKE. (0 0 2)
Nov  9 08:51:53 fuji2 unix: Current Nodee Status = 0

Here we see that there are error messages from the Ethernet controller indicating that the link is down, possibly because of a cable problem. This is the clue we need to solve this problem; the Ethernet used for the interconnect has failed for some reason. The investigation in this case should shift to the cables and hubs to insure that they are all powered up and securely connected.

Several options for the command cftool are listed above as sources for information. Some examples are as follows:

fuji2# cftool -l
Node    Number State       Os      Cpu
fuji2   2      UP          Solaris Sparc

This shows that the local node has joined a cluster as node number 2 and is currently UP. This is the normal state when the cluster is operational. Another possible response is as follows:

fuji2# cftool -l
Node    Number State     Os
fuji2 --     COMINGUP  --

This indicates that the CF driver is loaded and that the node is attempting to join a cluster. If the node stays in this state for more than a few minutes, then something is wrong and we need to examine the /var/adm/messages file. In this case, we see the following:

fuji2# tail /var/adm/messages
May 30 17:36:39 fuji2 unix: pseudo-device: fcp0
May 30 17:36:39 fuji2 unix: fcp0 is /pseudo/fcp@0
May 30 17:36:53 fuji2 unix:  LOG3.0991269413 1080024   1007 5    0    1.0         cf:eventlog     CF: (TRACE): JoinServer: Startup.
May 30 17:36:53 fuji2 unix:  LOG3.0991269413 1080024   1009 5    0    1.0         cf:eventlog     CF: Giving UP Mastering (Cluster already Running).
May 30 17:36:53 fuji2 unix:  LOG3.0991269413 1080024   1006 4    0    1.0         cf:eventlog     CF: fuji4: busy: local node not DOWN: retrying.

We see that this node is in the LEFTCLUSTER state on another node (fuji4). To resolve this condition, see "Chapter 5 LEFTCLUSTER state" for the description and the instructions for resolving the state.

The next option to cftool shows the device states as follows:

fuji2# cftool -d
Number Device    Type Speed    Mtu      State Configured Address
1      /dev/hme0 4    100      1432     UP    YES        00.80.17.28.21.a6
2      /dev/hme3 4    100      1432     UP    YES        08.00.20.ae.33.ef
3      /dev/hme4 4    100      1432     UP    YES        08.00.20.b7.75.8f
4      /dev/ge0  4    1000     1432     UP    YES        08.00.20.b2.1b.a2
5      /dev/ge1  4    1000     1432     UP    YES        08.00.20.b2.1b.b5

Here we can see the interconnects configured for the cluster (the lines with YES in the Configured column). This information shows the names of the devices and the device numbers for use in further troubleshooting steps.

The cftool -n command displays the states of all the nodes in the cluster. The node must be a member of a cluster and UP in the cftool -l output before this command will succeed as shown in the following:

fuji2# cftool -n
Node    Number State       Os       Cpu
fuji2   1      UP          Solaris  Sparc
fuji3   2      UP          Solaris  Sparc

This indicates that the cluster consists of two nodes fuji2 and fuji3, both of which are UP. If the node has not joined a cluster, the command will wait until the join succeeds.

cftool -r lists the routes and the current status of the routes as shown in the following example:

fuji2# cftool -r
Node    Number Srcdev Dstdev Type State Destaddr
fuji2    1      4      4      4    UP    08.00.20.b2.1b.cc
fuji2    1      5      5      4    UP    08.00.20.b2.1b.94
fuji3    2      4      4      4    UP    08.00.20.b2.1b.a2
fuji3    2      5      5      4    UP    08.00.20.b2.1b.b5

This shows that all of the routes are UP. If a route shows a DOWN state, then the step above where we examined the error log should have found an error message associated with the device. At least the CF error noting the route is down should occur in the error log. If there is not an associated error from the device driver, then the diagnosis steps are covered below.

The last route to a node is never marked DOWN, it stays in the UP state so that the software can continue to try to access the node. If a node has left the cluster or gone down, there will still be an entry for the node in the route table and one of the routes will still show as UP. Only the cftool -n output shows the state of the nodes as shown in the following:

fuji2# cftool -r
Node    Number Srcdev Dstdev Type State Destaddr
fuji2   2      3      2      4    UP    08.00.20.bd.5e.a1
fuji3   1      3      3      4    UP    08.00.20.bd.60.e4

fuji2# cftool -n
Node    Number State        Os      Cpu
fuji2   2      UP           Solaris Sparc
fuji3   1      LEFTCLUSTER  Solaris Sparc