H.3.3 Failure occurs during operation (Common to both Single and Cluster system)

Phenomenon:

Even though there is no error in network devices, the following message is output and HUB monitoring ends abnormally.

hanet: ERROR: 87000: polling status changed: primary polling failed. (hme0,target=192.13.71.20)
hanet: ERROR: 87100: polling status changed: secondary polling failed. (hme1,target=192.13.71.21)

Cause and how to deal with:

In NIC switching mode, it may take time before transmission becomes possible due to establishing a data link at Ethernet level following activation of an interface or the STP (Spanning Tree Protocol) transfer delay timer. Then, the status will not change to allow for immediate transmission even if the interface is activated. Generally, after the activation of the interface, the process establishing a data link will finish within several seconds, but in the instance of setting up the use of the STP, it may take 30 to 50 seconds until the status becomes to allow for transmission due to the STP transfer delay timer.
Therefore, if using the hanetpoll on command to shorten the link up completion waiting time (default value: 60 seconds), ping monitoring fails and switching occurs.
In such a case, extend the time to wait for linking up (default value: 60 seconds) by the hanetpoll on command according to the transfer delay time.
On the HUB where STP is running, possible next connection could take twice as the transfer delay time (normally 30 sec) after linked up. Standard link up latency of operating STP can be derived from the equation below.
For verifying STP transfer delay time, see the manual of HUB you are using.

link up latency > STP transfer delay time x 2 + monitoring period x number of monitoring

Note

To operate ping monitoring over the system that runs firewall, configure the firewall so that ping can pass through the firewall. Otherwise, it fails to operate ping monitoring.
The firewall settings must be the same for both of the primary and secondary interfaces.

H.3.3.2 Takes time to execute an operation command or to activate a userApplication

Phenomenon:: Takes time to execute an operation command of a Redundant Line Control function.
Takes time to activate a userApplication or to switch nodes at the cluster operation.

Cause and how to deal with:: When a host name or an IP address specified in the information of a virtual interface, the monitor-to information, etc. is not described in /etc/inet/hosts file, or when "files" are not specified at the top in an address solution of /etc/nsswitch.conf, occasionally it takes time to process an internally executed name-address conversion. Therefore, it takes time to execute a command, or for the cluster state to change. Check that all IP addresses and host names to use in a Redundant Line Control function are described in /etc/inet/hosts, and that /etc/inet/hosts is referred first at name-address conversion.

H.3.3.3 TCP connection is not divided in GS/SURE linkage mode

Phenomenon:: Even though TCP communication by a virtual IP is executed in GS/SURE linkage mode, the number of the connections is not shown when displayed how the connection is divided using a dsphanet command.

# /opt/FJSVhanet/usr/sbin/dsphanet -c
 Name   IFname Connection
+------+------+----------+
 sha0   sha2        -
        sha1        -
 sha10  sha12       -
        sha11       -

Cause and how to deal with:: When dividing TCP connection in GS/SURE linkage mode, necessary to define the information of the other system with a hanetobserv command. Any protocol other than TCP is not divided. UDP and ICMP are sent according to the route information.

H.3.3.4 A virtual driver hang up was detected by the Self-Check function

Phenomenon:: A virtual driver hang up was detected by the Self-Check function.

Cause and how to deal with:

The Self-Check function starts the processes for the monitoring. If this process did not work for more than 60 seconds due to high system load, it might mistakenly detect a driver hang up.

After the hang up detection message has been output, if the status is displayed normally when using the dsphanet command, the driver is not hung up.

Mistaken hang up prevention can be prevented by extending the driver hang up detection time.

Follow the procedure below to extend the hang up detection time.

Edit the settings file and set the detection time.
/etc/opt/FJSVhanet/config/mond.conf
drv_resp 120 <- Add a parameter and set a value.
drv_resp: virtual driver hang up detection time (in seconds)
A value from 1 to 3600 can be specified.

When the additional line is not set, 60 is set by default.
The virtual driver hang up detection cannot be disabled.
Reboot the system.
# /usr/sbin/shutdown -y -g0 -i6

H.3.3.5 ping command to HUB monitoring destination hangs

Phenomenon:

The following error message is output to syslog and NICs are switched.

ERROR: 93100: hangup of ping command has been detected. (target=HUB monitoring destination IP address)

Cause and how to deal with:

This phenomenon occurs when the ping command that is executed by ping monitoring of the HUB monitoring function does not complete within 30 seconds.

It is considered to have been caused by the defective NIC or temporarily high-load OS.

If it is caused by the temporarily high-load OS, edit the following file and extend the hang up detection time of the ping command.

File: /etc/opt/FJSVhanet/config/ctld.param

#
#  HA-Net Configuration File
#
#       Each entry is of the form:
#
#       <param> <value or string>
#

observ_msg                      0       # suppress observe message
transition_mode                 0       # resource status transition mode
logicalif_takeover_type         1       # takeover Zone and RAC interface
ping_hang_detect_time          90       <- Add a parameter and set a value.

ping_hang_detect_time: ping command hang up detection time (in seconds)

A value from 5 to 300 can be specified.

Setting 0 disables the ping hang detection function.

When the additional line is not set, 30 is set by default.