Top
PRIMECLUSTER  Cluster Foundation Configuration and Administration Guide 4.3
FUJITSU Software

8.2.1 Join-related problems

Join problems occur when a node is attempting to become a part of a cluster. The problems covered here are for a node that has previously successfully joined a cluster. If this is the first time that a node is joining a cluster, the Software Release Guide PRIMECLUSTER and the Installation Guide for PRIMECLUSTER section on verification covers the issues of initial startup. If this node has previously been a part of the cluster and is now failing to rejoin the cluster, here are some initial steps in identifying the problem.

8.2.1.1 Identifying join-related problems

First, look in the error log and at the console messages for any clue to the problem. Have the Ethernet drivers reported any errors? Any other unusual errors? If there are errors in other parts of the system, the first step is to correct those errors. Once the other errors are corrected, or if there were no errors in other parts of the system, proceed as follows.

Is the CF device driver loaded? The device driver puts a message in the log file when it loads and the cftool -l command will indicate the state of the driver. The logfile message looks as follows:

CF: (TRACE): JoinServer: Startup.

cftool -l prints the state of the node as in the following:

root@fuji2> cftool -l
Node        Number  State      Os
fuji2       --      COMINGUP   --

This indicates that the driver is loaded and that the node is trying to join a cluster. If the errorlog message above does not appear in the logfile or the cftool -l command fails, then the device driver is not loading. If there is no indication in the /var/log/messages file or on the console why the CF device driver is not loading, it could be that the CF kernel binaries or commands are corrupted, and you might need uninstall and reinstall CF. Before any further steps can be taken, the device driver must be loaded.

After the CF device driver is loaded, it attempts to join a cluster as indicated by the following message:

CF: (TRACE): JoinServer: Startup

The join server will attempt to contact another node on the configured interconnects. If one or more other nodes have already started a cluster, this node will attempt to join that cluster. The following message in the error log indicates that this has occurred:

CF: Giving UP Mastering (Cluster already Running).

If this message does not appear in the error log, then the node did not see any other node communicating on the configured interconnects and it will start a cluster of its own. The following two messages will indicate that a node has formed its own cluster as follows:

CF: Local Node fuji2 Created Cluster FUJI. (#0000 1)
CF: Node fuji2 Joined Cluster FUJI. (#0000 1)

At this point, we have verified that the CF device driver is loading and the node is attempting to join a cluster. In the following list, problems are described with corrective actions. Find the problem description that most closely matches the symptoms of the node being investigated and follow the steps outlined there.

Problem

The following are typical join problems.

The node does not join an existing cluster; it forms a cluster of its own.

Diagnosis

The error log shows the following messages:

CF: (TRACE): JoinServer: Startup.
CF: Local Node fuji2 Created Cluster FUJI. (#0000 1)
CF: Node fuji2 Joined Cluster FUJI. (#0000 1)

This indicates that the CF devices are all operating normally and suggests that the problem is occurring some place in the interconnect. The first step is to determine if the node can see the other nodes in the cluster over the interconnect. Use cftool(1M) to send an echo request to all the nodes of the cluster:

root@fuji2> cftool -e
Localdev Srcdev Address  Cluster Node      Number Joinstate
3   2 00.03.47.c2.a8.82  FUJI     fuji2    2      6
3   3 00.03.47.d1.af.ec  FUJI     fuji3    1      6

This shows that node fuji3 sees node fuji2 using interconnect device 3 (Localdev) on fuji3 and device 2 (Srcdev) on fuji2. If the cftool -e shows only the node itself continue on in this section." If some or all of the expected cluster nodes appear in the list, attempt to rejoin the cluster by unloading the CF driver and then reloading the driver as follows:

root@fuji2> cfconfig -u
root@fuji2> cfconfig -l

Note

There is no output from either of these commands, only error messages in the error log.

Problem

The node does not join the cluster and some or all the nodes respond to cftool -e.

Diagnosis

At this point, we know that the CF device is loading properly and that this node can communicate with at least one other node in the cluster. We should suspect at this point that the interconnect is missing messages. One way to test this hypothesis is to repeatedly send echo requests and see if the result changes over time, for example:

root@fuji2> cftool -e 
Localdev Srcdev Address  Cluster  Node     Number Joinstate
3   2 00.03.47.c2.aa.f9  FUJI     fuji2    3      6
3   2 00.03.47.c2.a8.82  FUJI     fuji3    2      6
3   3 00.03.47.d1.af.ec  FUJI     fuji4    1      6
root@fuji2> cftool -e
Localdev Srcdev Address Cluster Node       Number Joinstate
3   2 00.03.47.c2.aa.f9   FUJI     fuji2    3       6
3   2 00.03.47.c2.a8.82   FUJI     fuji3    2       6
3   3 00.03.47.d1.af.ec   FUJI     fuji4    1       6
3   3 00.03.47.d1.ae.f9   FUJI     fuji5    1       6
root@fuji2> cftool -e 
Localdev Srcdev Address Cluster Node      Number Joinstate
3   2 00.03.47.c2.aa.f9   FUJI     fuji2    3       6
3   2 00.03.47.c2.a8.82   FUJI     fuji3    2       6
3   3 00.03.47.d1.af.ec   FUJI     fuji4    1       6
root@fuji2> cftool -e
Localdev Srcdev Address Cluster Node       Number Joinstate
3   2 00.03.47.c2.aa.f9   FUJI     fuji2    3       6
3   2 00.03.47.c2.a8.82   FUJI     fuji3    2       6
3   3 00.03.47.d1.af.ec   FUJI     fuji4    1       6
3   3 00.03.47.d1.ae.f9   FUJI     fuji5    1       6
root@fuji2> cftool -e
Localdev Srcdev Address Cluster Node      Number Joinstate
3   2 00.03.47.c2.aa.f9   FUJI     fuji2    3       6
3   2 00.03.47.c2.a8.82   FUJI     fuji3    2       6
3   3 00.03.47.d1.af.ec   FUJI     fuji4    1       6
3   3 00.03.47.d1.ae.f9   FUJI     fuji5    1       6
root@fuji2> cftool -e
Localdev Srcdev Address Cluster Node      Number Joinstate
3   2 00.03.47.c2.aa.f9   FUJI     fuji2    3       6
3   2 00.03.47.c2.a8.82   FUJI     fuji3    2       6
3   3 00.03.47.d1.af.ec   FUJI     fuji4    1       6
3   3 00.03.47.d1.ae.f9   FUJI     fuji5    1       6

Notice that the node fuji5 does not show up in each of the echo requests. This indicates that the connection to the node fuji5 is having errors. Because only this node is exhibiting the symptoms, we focus on that node. First, we need to examine the node to see if the Ethernet commands on that node show any errors. We log on to fuji5 and use the netstat(8) or ip(8) command to find out the network interface information and errors.

The netstat(8) or ip(8) command in Linux reports information about the network interfaces.

Further resolution of the problem consists of trying each of the following steps:

If none of these steps resolves the problem, then field engineers will have to further diagnose the problem.

Problem

The following console message appears on node fuji3 while node fuji2 is trying to join the cluster with node fuji3:

Aug 30 21:31:35 fuji3 kernel: CF: Local node is missing a route from node: fuji2.
Aug 30 21:31:35 fuji3 kernel: CF: missing route on local device: eth1.
Aug 30 21:31:35 fuji3 kernel: CF: Node fuji2 Joined Cluster FUJI. (#0000 3)

Diagnosis

Look in /var/log/messages on node fuji2.

Same message as on console.

No console messages on node fuji3.

Look in /var/log/messages on node fuji3.

fuji3:cftool -d
Number  Device  Type  Speed    Mtu      State  Configured  Address
1       eth0    4     100      1432     UP     YES        00.03.47.c2.a8.82
2       eth1    4     100      1432     UP     YES        00.02.b3.88.09.f1
3       eth2    4     100      1432     UP     NO         00.02.b3.88.09.ea
fuji2:cftool -d
Number  Device  Type  Speed    Mtu      State  Configured  Address
1       eth0    4     100      1432     UP     YES        00.03.47.c2.a8.3c
2       eth1    4     100      1432     UP     NO         00.02.b3.88.b8.89
3       eth2    4     100      1432     UP     NO         00.02.b3.88.b7.46

Problem

eth1 is not configured are on node fuji2:

Diagnosis

Look in /var/log/messages on node fuji3.

Aug 27 16:05:59 fuji3 kernel: e100: eth1 NIC Link is Down
Aug 27 16:06:08 fuji3 kernel: CF: Icf Error: (service err_type route_src route_dst). (#0000 0 2 1 1)
Aug 27 16:06:08 fuji3 kernel: CF: (TRACE): CFSF failure detected: no SFopen: passed  to ENS: fuji2. (#0000 1)
Aug 27 16:06:08 fuji3 kernel: CF: Node fuji2 Left Cluster FUJI. (#00001)

Problem

The eth1 device or interconnect temporarily failed. It could be the NIC on either of the cluster nodes or a cable or the following hub problem.

Node in LEFTCLUSTER state

Node fuji2 panicked and has rebooted. The following console message appears on node fuji2:

Aug 28 10:38:25 fuji2  kernel: CF: fuji2: busy: local node not DOWN: retrying

Diagnosis

Look in /var/log/messages on node fuji2.

Aug 28 10:38:   fuji2  kernel: CF: (TRACE): JoinServer: Startup.
Aug 28 10:38:25 fuji2  kernel: CF: Giving UP Mastering (Cluster already Running).
Aug 28 10:38:25 fuji2  kernel: CF: fuji3: busy: local node not DOWN: retrying

Last message repeats.

No new messages on console or in /var/log/messages on fuji3.

fuji3:cftool -n
Node        Number     State         Os       Cpu
fuji2       1          LEFTCLUSTER   Linux    Pentium 
fuji3       2          UP            Linux    Pentium

Problem

Node fuji2 has left the cluster and has not been declared DOWN.

Fix

cftool -k

This option will declare a node down. Declaring an operational node down can result in catastrophic consequences, including loss of data in the worst case. If you do not wish to declare a node down, quit this program now.

Enter node number: 1
Enter name for node #1: fuji2
cftool(down): declaring node #1 (fuji2) down
cftool(down): node fuji2 is down

The following console messages then appear on node fuji3:

Aug 28 10:47:39 fuji5 kernel: CF: FUJI: fuji2  is Down. (#0000 2)
Aug 28 10:49:09 fuji5 kernel: CF: Node fuji2  Joined Cluster FUJI. (#0000 2)

The following console message appears on node fuji2: