Join problems occur when a node is attempting to become a part of a cluster. The problems covered here are for a node that has previously successfully joined a cluster. If this is the first time that a node is joining a cluster, the Software Release Guide PRIMECLUSTER and the Installation Guide for PRIMECLUSTER section on verification covers the issues of initial startup. If this node has previously been a part of the cluster and is now failing to rejoin the cluster, here are some initial steps in identifying the problem.
First, look in the error log and at the console messages for any clue to the problem. Have the Ethernet drivers reported any errors? Any other unusual errors? If there are errors in other parts of the system, the first step is to correct those errors. Once the other errors are corrected, or if there were no errors in other parts of the system, proceed as follows.
Is the CF device driver loaded? The device driver puts a message in the log file when it loads and the cftool -l command will indicate the state of the driver. The logfile message looks as follows:
CF: (TRACE): JoinServer: Startup.
cftool -l prints the state of the node as in the following:
root@fuji2> cftool -l
Node Number State Os
fuji2 -- COMINGUP --
This indicates that the driver is loaded and that the node is trying to join a cluster. If the errorlog message above does not appear in the logfile or the cftool -l command fails, then the device driver is not loading. If there is no indication in the /var/log/messages file or on the console why the CF device driver is not loading, it could be that the CF kernel binaries or commands are corrupted, and you might need uninstall and reinstall CF. Before any further steps can be taken, the device driver must be loaded.
After the CF device driver is loaded, it attempts to join a cluster as indicated by the following message:
CF: (TRACE): JoinServer: Startup
The join server will attempt to contact another node on the configured interconnects. If one or more other nodes have already started a cluster, this node will attempt to join that cluster. The following message in the error log indicates that this has occurred:
CF: Giving UP Mastering (Cluster already Running).
If this message does not appear in the error log, then the node did not see any other node communicating on the configured interconnects and it will start a cluster of its own. The following two messages will indicate that a node has formed its own cluster as follows:
CF: Local Node fuji2 Created Cluster FUJI. (#0000 1) CF: Node fuji2 Joined Cluster FUJI. (#0000 1)
At this point, we have verified that the CF device driver is loading and the node is attempting to join a cluster. In the following list, problems are described with corrective actions. Find the problem description that most closely matches the symptoms of the node being investigated and follow the steps outlined there.
Problem
The following are typical join problems.
The node does not join an existing cluster; it forms a cluster of its own.
Diagnosis
The error log shows the following messages:
CF: (TRACE): JoinServer: Startup. CF: Local Node fuji2 Created Cluster FUJI. (#0000 1) CF: Node fuji2 Joined Cluster FUJI. (#0000 1)
This indicates that the CF devices are all operating normally and suggests that the problem is occurring some place in the interconnect. The first step is to determine if the node can see the other nodes in the cluster over the interconnect. Use cftool(1M) to send an echo request to all the nodes of the cluster:
root@fuji2> cftool -e
Localdev Srcdev Address Cluster Node Number Joinstate
3 2 00.03.47.c2.a8.82 FUJI fuji2 2 6
3 3 00.03.47.d1.af.ec FUJI fuji3 1 6
This shows that node fuji3 sees node fuji2 using interconnect device 3 (Localdev) on fuji3 and device 2 (Srcdev) on fuji2. If the cftool -e shows only the node itself continue on in this section." If some or all of the expected cluster nodes appear in the list, attempt to rejoin the cluster by unloading the CF driver and then reloading the driver as follows:
root@fuji2> cfconfig -u root@fuji2> cfconfig -l
Note
There is no output from either of these commands, only error messages in the error log.
Problem
The node does not join the cluster and some or all the nodes respond to cftool -e.
Diagnosis
At this point, we know that the CF device is loading properly and that this node can communicate with at least one other node in the cluster. We should suspect at this point that the interconnect is missing messages. One way to test this hypothesis is to repeatedly send echo requests and see if the result changes over time, for example:
root@fuji2> cftool -e
Localdev Srcdev Address Cluster Node Number Joinstate
3 2 00.03.47.c2.aa.f9 FUJI fuji2 3 6
3 2 00.03.47.c2.a8.82 FUJI fuji3 2 6
3 3 00.03.47.d1.af.ec FUJI fuji4 1 6
root@fuji2> cftool -e
Localdev Srcdev Address Cluster Node Number Joinstate
3 2 00.03.47.c2.aa.f9 FUJI fuji2 3 6
3 2 00.03.47.c2.a8.82 FUJI fuji3 2 6
3 3 00.03.47.d1.af.ec FUJI fuji4 1 6
3 3 00.03.47.d1.ae.f9 FUJI fuji5 1 6
root@fuji2> cftool -e
Localdev Srcdev Address Cluster Node Number Joinstate
3 2 00.03.47.c2.aa.f9 FUJI fuji2 3 6
3 2 00.03.47.c2.a8.82 FUJI fuji3 2 6
3 3 00.03.47.d1.af.ec FUJI fuji4 1 6
root@fuji2> cftool -e
Localdev Srcdev Address Cluster Node Number Joinstate
3 2 00.03.47.c2.aa.f9 FUJI fuji2 3 6
3 2 00.03.47.c2.a8.82 FUJI fuji3 2 6
3 3 00.03.47.d1.af.ec FUJI fuji4 1 6
3 3 00.03.47.d1.ae.f9 FUJI fuji5 1 6
root@fuji2> cftool -e
Localdev Srcdev Address Cluster Node Number Joinstate
3 2 00.03.47.c2.aa.f9 FUJI fuji2 3 6
3 2 00.03.47.c2.a8.82 FUJI fuji3 2 6
3 3 00.03.47.d1.af.ec FUJI fuji4 1 6
3 3 00.03.47.d1.ae.f9 FUJI fuji5 1 6
root@fuji2> cftool -e
Localdev Srcdev Address Cluster Node Number Joinstate
3 2 00.03.47.c2.aa.f9 FUJI fuji2 3 6
3 2 00.03.47.c2.a8.82 FUJI fuji3 2 6
3 3 00.03.47.d1.af.ec FUJI fuji4 1 6
3 3 00.03.47.d1.ae.f9 FUJI fuji5 1 6
Notice that the node fuji5 does not show up in each of the echo requests. This indicates that the connection to the node fuji5 is having errors. Because only this node is exhibiting the symptoms, we focus on that node. First, we need to examine the node to see if the Ethernet commands on that node show any errors. We log on to fuji5 and use the netstat(8) or ip(8) command to find out the network interface information and errors.
The netstat(8) or ip(8) command in Linux reports information about the network interfaces.
Further resolution of the problem consists of trying each of the following steps:
Ensure that the Ethernet cable is securely inserted at each end.
Try repeated cftool -e and look at the netstat -I or ip -s link. If the results of the cftool(1M) are always the same and the input errors are gone or greatly reduced, the problem is solved.
Replace the Ethernet cable.
Try a different port in the Ethernet hub or switch or replace the hub or switch, or temporarily use a cross-connect cable.
Replace the Ethernet adapter in the node.
If none of these steps resolves the problem, then field engineers will have to further diagnose the problem.
Problem
The following console message appears on node fuji3 while node fuji2 is trying to join the cluster with node fuji3:
Aug 30 21:31:35 fuji3 kernel: CF: Local node is missing a route from node: fuji2. Aug 30 21:31:35 fuji3 kernel: CF: missing route on local device: eth1. Aug 30 21:31:35 fuji3 kernel: CF: Node fuji2 Joined Cluster FUJI. (#0000 3)
Diagnosis
Look in /var/log/messages on node fuji2.
Same message as on console.
No console messages on node fuji3.
Look in /var/log/messages on node fuji3.
fuji3:cftool -d
Number Device Type Speed Mtu State Configured Address
1 eth0 4 100 1432 UP YES 00.03.47.c2.a8.82
2 eth1 4 100 1432 UP YES 00.02.b3.88.09.f1
3 eth2 4 100 1432 UP NO 00.02.b3.88.09.ea
fuji2:cftool -d
Number Device Type Speed Mtu State Configured Address
1 eth0 4 100 1432 UP YES 00.03.47.c2.a8.3c
2 eth1 4 100 1432 UP NO 00.02.b3.88.b8.89
3 eth2 4 100 1432 UP NO 00.02.b3.88.b7.46
Problem
eth1 is not configured are on node fuji2:
Diagnosis
Look in /var/log/messages on node fuji3.
Aug 27 16:05:59 fuji3 kernel: e100: eth1 NIC Link is Down Aug 27 16:06:08 fuji3 kernel: CF: Icf Error: (service err_type route_src route_dst). (#0000 0 2 1 1) Aug 27 16:06:08 fuji3 kernel: CF: (TRACE): CFSF failure detected: no SFopen: passed to ENS: fuji2. (#0000 1) Aug 27 16:06:08 fuji3 kernel: CF: Node fuji2 Left Cluster FUJI. (#00001)
Problem
The eth1 device or interconnect temporarily failed. It could be the NIC on either of the cluster nodes or a cable or the following hub problem.
Node in LEFTCLUSTER state
Node fuji2 panicked and has rebooted. The following console message appears on node fuji2:
Aug 28 10:38:25 fuji2 kernel: CF: fuji2: busy: local node not DOWN: retrying
Diagnosis
Look in /var/log/messages on node fuji2.
Aug 28 10:38: fuji2 kernel: CF: (TRACE): JoinServer: Startup. Aug 28 10:38:25 fuji2 kernel: CF: Giving UP Mastering (Cluster already Running). Aug 28 10:38:25 fuji2 kernel: CF: fuji3: busy: local node not DOWN: retrying
Last message repeats.
No new messages on console or in /var/log/messages on fuji3.
fuji3:cftool -n
Node Number State Os Cpu
fuji2 1 LEFTCLUSTER Linux Pentium
fuji3 2 UP Linux Pentium
Problem
Node fuji2 has left the cluster and has not been declared DOWN.
Fix
cftool -k
This option will declare a node down. Declaring an operational node down can result in catastrophic consequences, including loss of data in the worst case. If you do not wish to declare a node down, quit this program now.
Enter node number: 1 Enter name for node #1: fuji2 cftool(down): declaring node #1 (fuji2) down cftool(down): node fuji2 is down
The following console messages then appear on node fuji3:
Aug 28 10:47:39 fuji5 kernel: CF: FUJI: fuji2 is Down. (#0000 2) Aug 28 10:49:09 fuji5 kernel: CF: Node fuji2 Joined Cluster FUJI. (#0000 2)
The following console message appears on node fuji2: