Join problems occur when a node is attempting to become a part of a cluster. The problems covered here are for a node that has previously successfully joined a cluster. If this is the first time that a node is joining a cluster, the PRIMECLUSTER installation manual section on verification covers the issues of initial startup. If this node has previously been a part of the cluster and is now failing to rejoin the cluster, here are some initial steps in identifying the problem.
First, look in the error log and at the console messages for any clue to the problem. Have the Ethernet drivers reported any errors? Any other unusual errors? If there are errors in other parts of the system, the first step is to correct those errors. Once the other errors are corrected, or if there were no errors in other parts of the system, proceed as follows.
Is the CF device driver loaded? The device driver puts a message in the log file when it loads and the cftool -l command will indicate the state of the driver. The logfile message looks as follows:
CF: (TRACE): JoinServer: Startup.
cftool -l prints the state of the node as follows:
fuji2# cftool -l
Node Number State Os fuji2 -- COMINGUP --
This indicates the driver is loaded and the node is trying to join a cluster. If the errorlog message above does not appear in the logfile or the cftool -l command fails, then the device driver is not loading. If there is no indication in the /var/adm/messages file or on the console why the CF device driver is not loading, it could be that the CF kernel binaries or commands are corrupted, and you might need uninstall and reinstall CF. Before any further steps can be taken, the device driver must be loaded.
After the CF device driver is loaded, it attempts to join a cluster as indicated by the message "CF: (TRACE): JoinServer: Startup." The join server will attempt to contact another node on the configured interconnects. If one or more other nodes have already started a cluster, this node will attempt to join that cluster. The following message in the error log indicates that this has occurred:
CF: Giving UP Mastering (Cluster already Running).
If this message does not appear in the error log, then the node did not see any other node communicating on the configured interconnects and it will start a cluster of its own. The following two messages will indicate that a node has formed its own cluster:
CF: Local Node fuji2 Created Cluster FUJI. (#0000 1) CF: Node fuji2 Joined Cluster FUJI. (#0000 1)
At this point, we have verified that the CF device driver is loading and the node is attempting to join a cluster. In the following list, problems are described with corrective actions. Find the problem description that most closely matches the symptoms of the node being investigated and follow the steps outlined there.
Information
Note that the log3 prefix is stripped from all of the error message text displayed below. Messages in the error log will appear as follows:
Mar 10 09:47:55 fuji2 unix: LOG3.0952710475 1080024 1014 4 0 1.0 cf:ens CF: Local node is missing a route from node: fuji3
However they are shown here as follows:
CF: Local node is missing a route from node: fuji3
Join problems
Problem:
The node does not join an existing cluster, it forms a cluster of its own.
Diagnosis:
The error log shows the following messages:
CF: (TRACE): JoinServer: Startup. CF: Local Node fuji4 Created Cluster FUJI. (#0000 1) CF: Node fuji2 Joined Cluster FUJI. (#0000 1)
This indicates that the CF devices are all operating normally and suggests that the problem is occurring some place in the interconnect. The first step is to determine if the node can see the other nodes in the cluster over the interconnect. Use cftool to send an echo request to all the nodes of the cluster:
fuji2# cftool -e
Localdev Srcdev Address Cluster Node Number Joinstate
3 2 08.00.20.bd.5e.a1 FUJI fuji2 2 6
3 3 08.00.20.bd.60.ff FUJI fuji3 1 6
This shows that node fuji3 sees node fuji2 using interconnect device 3 (Localdev) on fuji3 and device 2 (Srcdev) on fuji2. If the cftool -e shows only the node itself then look under the Interconnect Problems heading for the problem "The node only sees itself on the configured interconnects." If some or all of the expected cluster nodes appear in the list, attempt to rejoin the cluster by unloading the CF driver and then reloading the driver as follows:
fuji2# cfconfig -u fuji2# cfconfig -l
Note
There is no output from either of these commands, only error messages in the error log.
If this attempt to join the cluster succeeds, then look under the Problem: "The node intermittently fails to join the cluster." If the node did not join the cluster then proceed with the problem below "The node does not join the cluster and some or all nodes respond to cftool -e."
Problem:
The node does not join the cluster and some or all nodes respond to cftool -e.
Diagnosis:
At this point, we know that the CF device is loading properly and that this node can communicate to at least one other node in the cluster. We should suspect at this point that the interconnect is missing messages. One way to test this hypothesis is to repeatedly send echo requests and see if the result changes over time as in the following example:
fuji2# cftool -e
Localdev Srcdev Address Cluster Node Number Joinstate
3 2 08.00.20.ae.33.ef FUJI fuji1 3 6
3 2 08.00.20.bd.5e.a1 FUJI fuji2 2 6
3 3 08.00.20.bd.60.ff FUJI fuji3 1 6
fuji2# cftool -e
Localdev Srcdev Address Cluster Node Number Joinstate
3 2 08.00.20.ae.33.ef FUJI fuji1 3 6
3 2 08.00.20.bd.5e.a1 FUJI fuji2 2 6
3 3 08.00.20.bd.60.ff FUJI fuji3 1 6
3 3 08.00.20.bd.60.e4 FUJI fuji4 1 6
fuji2# cftool -e
Localdev Srcdev Address Cluster Node Number Joinstate
3 2 08.00.20.ae.33.ef FUJI fuji1 3 6
3 2 08.00.20.bd.5e.a1 FUJI fuji2 2 6
3 3 08.00.20.bd.60.ff FUJI fuji3 1 6
fuji2# cftool -e
Localdev Srcdev Address Cluster Node Number Joinstate
3 2 08.00.20.ae.33.ef FUJI fuji1 3 6
3 2 08.00.20.bd.5e.a1 FUJI fuji2 2 6
3 3 08.00.20.bd.60.ff FUJI fuji3 1 6
3 3 08.00.20.bd.60.e4 FUJI fuji4 1 6
fuji2# cftool -e
Localdev Srcdev Address Cluster Node Number Joinstate
3 2 08.00.20.ae.33.ef FUJI fuji1 3 6
3 2 08.00.20.bd.5e.a1 FUJI fuji2 2 6
3 3 08.00.20.bd.60.ff FUJI fuji3 1 6
3 3 08.00.20.bd.60.e4 FUJI fuji4 1 6
fuji2# cftool -e
Localdev Srcdev Address Cluster Node Number Joinstate
3 2 08.00.20.ae.33.ef FUJI fuji1 3 6
3 2 08.00.20.bd.5e.a1 FUJI fuji2 2 6
3 3 08.00.20.bd.60.ff FUJI fuji3 1 6
3 3 08.00.20.bd.60.e4 FUJI fuji4 1 6
Notice that the node fuji4 does not show up in each of the echo requests. This indicates that the connection to the node fuji4 is having errors. Because only this node is exhibiting the symptoms, we focus on that node. First, we need to examine the node to see if the Ethernet utilities on that node show any errors. If we log on to fuji4 and look at the network devices, we see the following:
Number Device Type Speed Mtu State Configured Address 1 /dev/hme0 4 100 1432 UP NO 00.80.17.28.2c.fb 2 /dev/hme1 4 100 1432 UP NO 00.80.17.28.2d.b8 3 /dev/hme2 4 100 1432 UP YES 08.00.20.bd.60.e4
The netstat(1M) utility in Solaris reports information about the network interfaces. The first attempt will show the following:
fuji4# netstat -i Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue lo0 8232 loopback localhost 65 0 65 0 0 0 hme0 1500 fuji4 fuji4 764055 8 9175 0 0 0 hme1 1500 fuji4-priva fuji4-priva 2279991 0 2156309 0 7318 0
Notice that the hme2 interface is not shown in this report. This is because Solaris does not report on interconnects that are not configured for TCP/IP. To temporarily make Solaris report on the hme2 interface, enter the ifconfig plumb command as follows:
fuji4# ifconfig hme2 plumb
Repeat the command as follows:
fuji4# netstat -i
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue lo0 8232 loopback localhost 65 0 65 0 0 0 hme0 1500 fuji4 fuji4 765105 8 9380 0 0 0 hme1 1500 fuji4-priva fuji4-priva 2282613 0 2158931 0 7319 0 hme2 1500 default 0.0.0.0 752 100 417 0 0 0
Here we can see that the hme2 interface has 100 input errors (Ierrs) from 752 input packet (Ipkts). This means that one in seven packets had an error; this rate is too high for PRIMECLUSTER to use successfully. This also explains why fuji4 sometimes responded to the echo request from fuji2 and sometimes did not.
Point
It is always safe to plumb the interconnect. This will not interfere with the operation of PRIMECLUSTER.
To resolve these errors further, we can look at the undocumented -k option to the Solaris netstat command as follows:
fuji4# netstat -k hme2
hme2:
ipackets 245295 ierrors 2183 opackets 250486 oerrors 0 collisions 0
defer 0 framing 830 crc 1353 sqe 0 code_violations 38 len_errors 0
ifspeed 100 buff 0 oflo 0 uflo 0 missed 0 tx_late_collisions 0
retry_error 0 first_collisions 0 nocarrier 0 inits 15 nocanput 0
allocbfail 0 runt 0 jabber 0 babble 0 tmd_error 0 tx_late_error 0
rx_late_error 0 slv_parity_error 0 tx_parity_error 0 rx_parity_error 0
slv_error_ack 0 tx_error_ack 0 rx_error_ack 0 tx_tag_error 0
rx_tag_error 0 eop_error 0 no_tmds 0 no_tbufs 0 no_rbufs 0
rx_late_collisions 0 rbytes 22563388 obytes 22729418 multircv 0 multixmt 0
brdcstrcv 472 brdcstxmt 36 norcvbuf 0 noxmtbuf 0 phy_failures 0
Most of this information is only useful to specialists for problem resolution. The two statistics that are of interest here are the framing and crc errors. These two error types add up to exactly the number reported in ierrors. Further resolution of this problem consists of trying each of the following steps:
Ensure the Ethernet cable is securely inserted at each end.
Try repeated cftool -e and look at the netstat -i. If the results of the cftool are always the same and the input errors are gone or greatly reduced, the problem is solved.
Replace the Ethernet cable.
Try a different port in the Ethernet hub or switch or replace the hub or switch, or temporarily use a cross-connect cable.
Replace the Ethernet adapter in the node.
If none of these steps resolves the problem, then your support personnel will have to further diagnose the problem.
Problem:
The following console message appears on node fuji2 while node fuji3 is trying to join the cluster with node fuji2:
Mar 10 09:47:55 fuji2 unix: LOG3.0952710475 1080024 1014 4 0 1.0 cf:ens CF: Local node is missing a route from node: fuji3 Mar 10 09:47:55 fuji2 unix: LOG3.0952710475 1080024 1014 4 0 1.0 cf:ens CF: missing route on local device: /dev/hme3 Mar 10 09:47:55 fuji2 unix: LOG3.0952710475 1080024 1014 4 0 1.0 cf:ens CF: Node fuji3 Joined Cluster FUJI. (0 1 0)
Diagnosis:
Look in /var/adm/messages on node fuji2.
Same message as on console.
No console messages on node fuji3.
Look in /var/adm/messages on node fuji3:
fuji2# cftool -d
Number Device Type Speed Mtu State Configured Address
1 /dev/hme0 4 100 1432 UP NO 08.00.06.0d.9f.c5
2 /dev/hme1 4 100 1432 UP YES 00.a0.c9.f0.15.c3
3 /dev/hme2 4 100 1432 UP YES 00.a0.c9.f0.14.fe
4 /dev/hme3 4 100 1432 UP NO 00.a0.c9.f0.14.fd
fuji3# cftool -d
Number Device Type Speed Mtu State Configured Address
1 /dev/hme0 4 100 1432 UP NO 08.00.06.0d.9f.c5
2 /dev/hme1 4 100 1432 UP YES 00.a0.c9.f0.15.c3
3 /dev/hme2 4 100 1432 UP YES 00.a0.c9.f0.14.fe
4 /dev/hme3 4 100 1432 UP YES 00.a0.c9.f0.14.fd
/dev/hme3 is not configured on node fuji2 Mar 10 11:00:28 fuji2 unix:WARNING:hme3:no MII link detected Mar 10 11:00:31 fuji2 unix:LOG3.0952714831 1080024 1008 4 0 1.0cf:ens CF:Icf Error:(service err_type route_src route_dst).(0 0 0 0 0 2 0 0 0 3 0 0 0 3 0 0 0) Mar 10 11:00:53 fuji2 unix:NOTICE:hme3:100 Mbps full-duplex link up Mar 10 11:01:11 fuji2 unix:LOG3.0952714871 1080024 1007 5 0 1.0cf:ens CF (TRACE):Icf:Route UP:node src dest.(0 2 0 0 0 3 0 0 0 3 0 0 0) The hme3 device or interconnect temporarily failed.
fuji2# cftool -n
Node Number State Os Cpu
fuji2 1 LEFTCLUSTER Solaris Sparc
fuji3 2 UP Solaris Sparc
Problem:
/dev/hme3 is not configured on node fuji2.
Mar 10 11:00:28 fuji2 unix: WARNING: hme3: no MII link detected Mar 10 11:00:53 fuji2 unix: NOTICE: hme3: 100 Mbps full-duplex link up
Diagnosis:
Look in /var/adm/messages on node fuji2:
Mar 10 11:00:28 fuji2 unix: WARNING: hme3: no MII link detected Mar 10 11:00:31 fuji2 unix: LOG3.0952714831 1080024 1008 4 0 1.0 cf:ens CF: Icf Error: (service err_type route_src route_dst). (0 0 0 0 0 2 0 0 0 3 0 0 0 3 0 0 0) Mar 10 11:00:53 fuji2 unix: NOTICE: hme3: 100 Mbps full-duplex link up Mar 10 11:01:11 fuji2 unix: LOG3.0952714871 1080024 1007 5 0 1.0 cf:ens CF (TRACE): Icf: Route UP: node src dest. (0 2 0 0 0 3 0 0 0 3 0 0 0)
Problem:
The hme3 device or interconnect temporarily failed. It could be the NIC on either of the cluster nodes or a cable or hub problem.
Node in LEFTCLUSTER state
IF SF is not configured, and node fuji2 panicked and has rebooted. The following console message appears on node fuji2:
Mar 10 11:23:41 fuji2 unix: LOG3.0952716221 1080024 1012 4 0 1.0 cf:ens CF: fuji2: busy: local node not down: retrying.
Diagnosis:
Look in /var/adm/messages on node fuji2:
Mar 10 11:23:41 fuji2 unix: LOG3.0952716221 1080024 1007 5 0 1.0 cf:ens CF (TRACE): JoinServer: Startup. Mar 10 11:23:41 fuji2 unix: LOG3.0952716221 1080024 1009 5 0 1.0 cf:ens CF: Giving UP Mastering (Cluster already Running). Mar 10 11:23:41 fuji2 unix: LOG3.0952716221 1080024 1012 4 0 1.0 cf:ens CF: Join postponed, server fuji3 is busy.
...last message repeats.
No new messages on console or in /var/adm/messages on fuji2:
fuji2: cftool -n Node Number State Os Cpu fuji2 1 LEFTCLUSTER Solaris Sparc fuji3 2 UP Solaris Sparc
Identified problem:
Node fuji2 has left the cluster and has not been declared DOWN.
Fix:
To fix this problem, enter the following command:
# cftool -k
This option will declare a node down. Declaring an operational node down can result in catastrophic consequences, including loss of data in the worst case. If you do not wish to declare a node down, quit this program now.
Enter node number: 1 Enter name for node #1: fuji2 cftool(down): declaring node #1 (fuji2) down cftool(down): node fuji2 is down
The following console messages then appear on node fuji2:
Mar 10 11:34:21 fuji2 unix: LOG3.0952716861 1080024 1005 5 0 1.0 cf:ens CF: MYCLUSTER: fuji2 is Down. (0 1 0) Mar 10 11:34:29 fuji2 unix: LOG3.0952716869 1080024 1004 5 0 1.0 cf:ens CF: Node fuji2 Joined Cluster MYCLUSTER. (0 1 0)
The following console message appears on node fuji2:
Mar 10 11:32:37 fuji2 unix: LOG3.0952716757 1080024 1004 5 0 1.0 cf:ens CF: Node fuji2 Joined Cluster MYCLUSTER. (0 1 0)