4.1.1 DR operation procedure

Performing DR will suspend the system. This might affect the PRIMECLUSTER node monitoring facility, and the node may be stopped forcibly. Stop the node monitoring facility of the cluster before performing DR.

In Oracle VM Server for SPARC environment, perform the operation before and after DR on all the guest domains and control domains on which PRIMECLUSTER is built.

Note

If you perform DR when the operations are being suspended, the system will be stopped. To avoid the system stop, the failover or failback of the operations must be done to make them continuous on the node where DR is not performed. Then, perform DR on the standby node.
Estimate in advance how long the system will be suspended due to the DR operation.
In the cluster system, operations such as the unmanned DR operation with time scheduling are not supported.

Take the following steps to perform DR.

Check the device name for the cluster interconnect.
When removing or replacing the device for the cluster interconnect, check that multiple devices that are not to be removed or replaced exist, and the state of these devices are "UP."
```
# cftool -d
Number Device    Type Speed   Mtu     State  Configured  Address
1      /dev/igb1  4    100    1432     UP     YES        00.00.0e.25.1a.38
2      /dev/igb7  4    100    1432     UP     YES        00.00.0e.25.1a.38
```
When deleting or replacing the device for the cluster interconnect, remove the device from the cluster interconnect.
```
# cfrecon -d <device name>
```
When performing hot swap in the configuration where GDS is used, see "2.2.1 Detaching the disk" to remove a disk from GDS class.
When performing hot swap in the configuration where GLS is used, see "3.2 Replacement of the system board using the DR" to disconnect an NIC from the multiplex configuration.
When deleting, replacing, or moving the system board, take the prior steps to remove the system board from the physical partition. When the domain should be restarted as the system board is deleted, replaced, or moved, restart the domain before the monitoring facility of PRIMECLUSTER is shut down or changed as follows. For details on deleting, replacing, or moving steps of the system board, see "Fujitsu SPARC M12 and Fujitsu M10/SPARC M10 Domain Configuration Guide."
Check the PRIMECLUSTER configuration file name by executing the "hvdisp -n" command on any of the nodes where RMS is running. In the example below, the RMS configuration file name is "config.us."
```
# hvdisp -n
/opt/SMAW/SMAWRrms/build/config.us
#
```

Stop PRIMECLUSTER RMS by executing the "hvshut" command on all the nodes. When you answer "yes", PRIMECLUSTER RMS will stop, however the applications defined on the cluster applications remain running.

# hvshut -L
                            WARNING
                            -------
The '-L' option of the hvshut command will shut down the RMS
software without bringing down any of the applications.
In this situation, it would be possible to bring up the same
application on another node in the cluster which *may* cause
data corruption.

Do you wish to proceed ? (yes = shut down RMS / no = leave RMS running).
yes

NOTICE: User has been warned of 'hvshut -L' and has elected to proceed.

Add the following line in the "/opt/SMAW/SMAWRrms/bin/hvenv.local" file on all the nodes.

export HV_RCSTART=0

The above procedure is necessary to prevent RMS start automatically right after the OS startup.

Stop PRIMECLUSTER SF by executing the "sdtool" command on all the nodes as follows.

# sdtool -e
(SMAWsf, 30, 11) : RCSD returned a successful exit code for this command

Change the timeout value of PRIMECLUSTER CF heartbeat monitoring. Perform the following operation on all the nodes:
- Check the set timeout value. This value is used to restore the settings.
```
# cfset -g CLUSTER_TIMEOUT
>From cfset configuration in CF module:
Value for key: CLUSTER_TIMEOUT --->10
#
```
  When the following message is displayed, the timeout value is 10 seconds (default value).
```
# cfset -g CLUSTER_TIMEOUT
cfset: No matching key found in CF Module
#
```
- Add the following setting to /etc/default/cluster.config.
```
CLUSTER_TIMEOUT "timeout"
```
  timeout(second) = time while the system is suspended due to DR + DR operation time
- Execute the following command.
```
# cfset -r
```
- Check whether or not the CF timeout value is valid.
```
# cfset -g CLUSTER_TIMEOUT
>From cfset configuration in CF module:
Value for key: CLUSTER_TIMEOUT --->timeout
#
```
Add, remove, replace, or move the system board.
To add the system board, take the prior steps to add the physical partition and to check the operation status of the logical domains.
To remove the system board, detach the system board from the physical partition.
To replace the system board, take the prior steps to detach the system board from the physical partition, to add the replaced system board to the physical partition, and then to check the operation status of the logical domains.
To move the system board, take the steps to add and remove the system board.
For details on adding, deleting, replacing, or moving steps of the system board, see "Fujitsu SPARC M12 and Fujitsu M10/SPARC M10 Domain Configuration Guide."
Return the CF heartbeat timeout to the default value on all the nodes as follows:
- Change the CLUSTER_TIMEOUT in /etc/default/cluster.config to the timeout value that is checked in step 9.
  Before change:
```
CLUSTER_TIMEOUT "timeout"
```
  timeout(second) = timeout time that is set in step 9
  After change: (when the original timeout value is 10)
```
CLUSTER_TIMEOUT "10"
```
- Execute the following command.
```
# cfset -r
```
- Check whether the timeout value is changed correctly.
```
# cfset -g CLUSTER_TIMEOUT
>From cfset configuration in CF module:
Value for key: CLUSTER_TIMEOUT --->10
#
```
When the building block configuration is added from 1BB to multiple BB, see "5.1.2.1.3 Using the Shutdown Configuration Wizard" in "PRIMECLUSTER in Installation and Administration Guide "to reconfigure the takeover IP address of XSCF for the XSCF IP address registered to SF.
Start PRIMECLUSTER SF. Execute the sdtool command on all nodes as follows.
```
# sdtool -b
```
Check if PRIMECLUSTER SF is running.
Select the [Tools]-[Shutdown Facility]-[Show Status] menu from the CF main window of Cluster Admin, then check the "Test State" field on each node.
Start PRIMECLUSTER RMS by executing the "hvcm" command as follows on all the nodes. Specify the RMS configuration file name, which is checked in step 6, for the "-c" option. For example, if the name is "config.us", "config" will be specified.
```
# hvcm -c config
Starting Reliant Monitor Services now
```
PRIMECLUSTER RMS must be running on all the nodes. Check if each icon indicating the node state is green (Online) in the RMS main window of Cluster Admin.
Remove the following line from "/opt/SMAW/SMAWRrms/bin/hvenv.local" on all the nodes.
```
export HV_RCSTART=0
```
When adding, replacing, or moving the system board, take the steps after restarting the use of I/O devices. For details on adding, removing, replacing, or moving steps of the system board, see "Fujitsu SPARC M12 and Fujitsu M10/SPARC M10 Domain Configuration Guide."
When the NIC is disconnected from the multiplex configuration in step 4, see "3.2 Replacement of the system board using the DR" to embed the NIC to the multiplex configuration.
When the disk is detached in step 3, see "2.2.3 Re-attaching the disk" to add the detached disk again.
To add or replace the device for the cluster interconnect, add the device to the cluster interconnect.
```
# cfrecon -a <device name>
```

Note

When a node ends abnormally (panic or reset) or hangs up due to hardware failure, or the node state becomes LEFTCLUSTER due to the timeout of CF while monitoring of RMS by PRIMECLUSTER is suspended, you should start the cluster applications on a standby node.

When a node ends abnormally (panic or reset) or hangs up, shut down the node forcibly. After that, wait until the failed node becomes LEFTCLUSTER. You need to wait for the timeout value that was changed in the above step 9.

Return the CF heartbeat timeout value by using the above step 11.
Start PRIMECLUSTER SF by using the above step 13.

When the state of the failed node does not become DOWN, execute the "sdtool -k <CF node name of the host node>" command so that the state of the failed node becomes DOWN.

# cftool -n
Node    Number State       Os      Cpu
node0   1      UP          Solaris Sparc
node1   2      LEFTCLUSTER Solaris Sparc
# sdtool -k node1
LOG3.013944205091080028 20 6 30 4.5A00 SMAWsf : RCSD returned a successful exit code for
this command(sdtool -k node1)
# cftool -n
Node    Number State       Os      Cpu
node0   1      UP          Solaris Sparc
node1   2      DOWN        Solaris Sparc
#

If the failed node remains in the UP state, the "sdtool -k" command fails.

Wait until the failed node becomes LEFTCLUSTER.

Start PRIMECLUSTER RMS by using the above step 15.

For the operational and standby cluster applications, execute the "hvswitch -f" command to start the cluster applications forcibly.

# hvswitch -f <userApplication>
The use of the -f (force) flag could cause your data to be corrupted and could cause your
node to be killed. Do not continue if the result of this forced command is not clear.

The use of force flag of hvswitch overrides the RMS internal security mechanism. In particular
RMS does no longer prevent resources, which have been marked as "ClusterExclusive", from
coming Online on more than one host in the cluster. It is recommended to double check the
state of all affected resources before continuing.

IMPORTANT: This command may kill nodes on which RMS is not running in order to reduce the
risk of data corruption!
Ensure that RMS is running on all other nodes. Or shut down OS of the node on which RMS
is not running.

Do you wish to proceed ? (default: no) [yes, no]:yes

Remove the following line from the /opt/SMAW/SMAWRrms/bin/hvenv.local file.
```
export HV_RCSTART=0
```

If the node becomes LEFTCLUSTER, take the following steps to clear LEFTCLUSTER.

See "6.2.3 Caused by a cluster partition" in "PRIMECLUSTER CF Configuration and Administration Guide" to clear LEFTCLUSTER manually.
Return the timeout value of PRIMECLUSTER CF by using the above step 11.
Start PRIMECLUSTER SF by using the above step 13.
Start PRIMECLUSTER RMS by using the above step 15.

For the operational and standby cluster applications, execute the "hvswitch -f" command to start the cluster applications forcibly.

# hvswitch -f <userApplication>
The use of the -f (force) flag could cause your data to be corrupted and could cause your
node to be killed. Do not continue if the result of this forced command is not clear.

The use of force flag of hvswitch overrides the RMS internal security mechanism. In particular
RMS does no longer prevent resources, which have been marked as "ClusterExclusive", from
coming Online on more than one host in the cluster. It is recommended to double check the
state of all affected resources before continuing.

IMPORTANT: This command may kill nodes on which RMS is not running in order to reduce the
risk of data corruption!
Ensure that RMS is running on all other nodes. Or shut down OS of the node on which RMS
is not running.

Do you wish to proceed ? (default: no) [yes, no]:yes

Remove the following line from the /opt/SMAW/SMAWRrms/bin/hvenv.local file.
```
export HV_RCSTART=0
```