PRIMECLUSTER makes part of the process to operate in a real time class. If you add or remove the system board using DR, the following message will appear:
Dec 25 21:12:41 Real time processes[pid= 4038 4218 4216 4286 4286 4286 4286 4046 4220 4134 4134 4134 4134 4134 4214 4221 4228 4287 4256 4291 4290 4288 4289 5350 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946 12946] is running. Do you continue DR ? [YES]/[NO]
Check if the displayed process IDs are for the PRIMECLUSTER daemons, then enter "yes" to continue with DR. (The same process ID will be repeatedly displayed for a multithread process.)
The real time processes that are started with PRIMECLUSTER are as follows:
PRIMECLUSTER
rcsd, bm, hvdet_* (Note 1)
Note 1:
The detector will be started up from the process bm as a real time process according to an application configuration. The process, in which the parent process is bm, and the process name starts with "hvdet_", is the PRIMECLUSTER detector.
If you frequently use DR, or if you want to make the DR process automatic, you can disable the message by creating an associated script and executing the "dr_conf" command. For details, see the "DR Users' Guide".
RMS might be suspended during DR use. If this occurs, the following warning message might be output:
(SYS, 88): WARNING: No heartbeat from cluster host node0RMS within the last 10 seconds. This may be a temporary problem caused by high system load. RMS will react if this problem persists for 590 seconds more.
This message indicates that the RMS heartbeat monitoring is suspended. If this appears during DR use, you do not have to take any action.
If a system board is added, replaced, or inter-partitioned with DR in a cluster system overloaded by cluster applications, this might affect the PRIMECLUSTER node monitoring facility resulting in node elimination.
Stop the node monitoring facility by using the following steps, so you can continue operation.
Check the PRIMECLUSTER configuration file name by executing the "hvdisp -n" command on any of the nodes where RMS is running. In the example below, the RMS configuration file name is "config.us."
# hvdisp -n
/opt/SMAW/SMAWRrms/build/config.us
#
Stop RMS by executing the "hvshut" command on all the nodes. Answer "yes", then only RMS will stop.
# hvshut -L
WARNING
-------
The '-L' option of the hvshut command will shut down the RMS
software without bringing down any of the applications.
In this situation, it would be possible to bring up the same
application on another node in the cluster which *may* cause
data corruption.
Do you wish to proceed ? (yes = shut down RMS / no = leave RMS running).
yes
NOTICE: User has been warned of 'hvshut -L' and has elected to proceed.
Add the following line in the "/opt/SMAW/SMAWRrms/bin/hvenv.local" file on all the nodes. Then, RMS will not automatically be started.
export HV_RCSTART=0
Stop SF by executing the "sdtool" command on all the nodes.
# sdtool -e
(SMAWsf, 30, 11) : RCSD returned a successful exit code for this command
Change the timeout value of CF heartbeat monitoring on all the nodes as follows:
Add the following line in the "/etc/default/cluster.config" file on all the nodes, so the CF heartbeat timeout will be 600 seconds.
CLUSTER_TIMEOUT "600"
Execute the following command on all the nodes.
# cfset -r
Check whether or not the CF timeout value is valid.
# cfset -g CLUSTER_TIMEOUT
>From cfset configuration in CF module:
Value for key: CLUSTER_TIMEOUT --->600
#
Use DR.
Return the CF heartbeat timeout to the default value on all the nodes as follows:
Change the CLUSTER_TIMEOUT in /etc/default/cluster.config to 10.
Before change
CLUSTER_TIMEOUT "600"
After change
CLUSTER_TIMEOUT "10"
Execute a following command on each nodes.
# cfset -r
Check whether or not the CF timeout is valid using the following command on all the nodes.
# cfset -g CLUSTER_TIMEOUT
>From cfset configuration in CF module:
Value for key: CLUSTER_TIMEOUT --->10
#
Start SF by executing the "sdtool" command on all the nodes.
# sdtool -b
Check if the SF is running.
Select the [Tools]-[Shutdown Facility]-[Show Status] menu from the CF main window of Cluster Admin, then check the "Test State" field on each node.
Start RMS by executing the "hvcm" command on all the nodes. Specify the RMS configuration file name, which is checked at step 1, for the "-c" option. For example, if the name is "/opt/SMAW/SMAWRrms/build/config.us", "config" will be specified.
# hvcm -c config
Starting Reliant Monitor Services now
RMS must be running on all the nodes. Check if each icon indicating the node state is green (Online) in the RMS main window of Cluster Admin.
Remove the following line from "/opt/SMAW/SMAWRrms/bin/hvenv.local" on all the nodes. Then, RMS startup will automatically be enabled.
export HV_RCSTART=0
Note
Be sure to verify a cluster system during cluster configuration using the above steps.
If a node failure such as a node panic or reset occurs during step 3 through 7 or the state of a node becomes LEFTCLUSTER due to the timeout of CF, cluster applications on a standby node must be started.
When a node ends abnormally (panic or reset) or hangs up, shut down the node forcibly. After that, wait until the failed node becomes LEFTCLUSTER. You need to wait for the timeout value that was changed in the above step 4.
Check that the failed not is not running, and then perform the following procedure:
Return the timeout value of PRIMECLUSTER CF by using the above step 6.
Start PRIMECLUSTER SF by using the above step 7.
When the state of the failed node does not become DOWN, execute the "sdtool -k <CF node name of the host node>" command so that the state of the host node becomes DOWN.
# cftool -n Node Number State Os Cpu node0 1 UP Solaris Sparc node1 2 LEFTCLUSTER Solaris Sparc # sdtool -k node1 LOG3.013944205091080028 20 6 30 4.3A20 SMAWsf : RCSD returned a successful exit code for this command(sdtool -k node1) # cftool -n Node Number State Os Cpu node0 1 UP Solaris Sparc node1 2 DOWN Solaris Sparc #
If the failed node remains in the UP state, the "sdtool -k" command fails.
Wait until the failed node becomes LEFTCLUSTER.
Start PRIMECLUSTER RMS by using the steps from 6 through 9.
For the operational and standby cluster applications, execute the "hvswitch -f" command to start the cluster applications forcibly.
# hvswitch -f <userApplication>
The use of the -f (force) flag could cause your data to be corrupted and could cause your node to be killed. Do not continue if the result of this forced command is not clear. The use of force flag of hvswitch overrides the RMS internal security mechanism. In particular RMS does no longer prevent resources, which have been marked as "ClusterExclusive", from coming Online on more than one host in the cluster. It is recommended to double check the state of all affected resources before continuing. IMPORTANT: This command may kill nodes on which RMS is not running in order to reduce the risk of data corruption! Ensure that RMS is running on all other nodes. Or shut down OS of the node on which RMS is not running. Do you wish to proceed ? (default: no) [yes, no]:yes
Remove the following line from the /opt/SMAW/SMAWRrms/bin/hvenv.local file.
export HV_RCSTART=0
If the node becomes LEFTCLUSTER, take the following steps to clear LEFTCLUSTER.
See "6.2.3 Caused by a cluster partition" in "PRIMECLUSTER CF Configuration and Administration Guide" to clear LEFTCLUSTER manually.
Return the timeout value of PRIMECLUSTER CF by using the above step 6.
Start PRIMECLUSTER SF by using the above step 7.
Start PRIMECLUSTER RMS by using the above step 9.
For the operational and standby cluster applications, execute the "hvswitch -f" command to start the cluster applications forcibly.
# hvswitch -f <userApplication>
The use of the -f (force) flag could cause your data to be corrupted and could cause your node to be killed. Do not continue if the result of this forced command is not clear. The use of force flag of hvswitch overrides the RMS internal security mechanism. In particular RMS does no longer prevent resources, which have been marked as "ClusterExclusive", from coming Online on more than one host in the cluster. It is recommended to double check the state of all affected resources before continuing. IMPORTANT: This command may kill nodes on which RMS is not running in order to reduce the risk of data corruption! Ensure that RMS is running on all other nodes. Or shut down OS of the node on which RMS is not running. Do you wish to proceed ? (default: no) [yes, no]:yes
Remove the following line from the /opt/SMAW/SMAWRrms/bin/hvenv.local file.
export HV_RCSTART=0
PRIMECLUSTER do not support the unmanned DR operation with time scheduling.