8.9.4 Tuning for Optimization of Degrading Operation Using Abnormality Monitoring

Mirroring Controller uses a monitoring method that outputs an error if the timeout time is exceeded when accessing resources targeted for monitoring. If the timeout time is short, switch or disconnection of the standby server can be performed faster, however, there is greater risk of misdetection, so an appropriate design is required.

You can optimize degrading operation by editing the values for the following parameters in the server definition file in accordance with the system. Refer to "Appendix A Parameters" for information on how to edit these parameters.

Table 8.6 Parameters
Parameter	Description
Abnormality monitoring interval (heartbeat_interval)	Mirroring Controller is configured so that abnormality monitoring does not place a load on the system. This parameter does not normally need to be set. (The default is 800 milliseconds.)
Abnormality monitoring timeout time (heartbeat_timeout)	Set the time during which a load is placed continuously on the server or network performance. For example, it is envisaged that this parameter will be used in situations such as when performing high-load batch jobs or when a large number of online jobs occur continuously and concurrently.
Abnormality monitoring retry times (heartbeat_retry)	This parameter can be set when needing a safety value for situations in which the value specified for heartbeat_timeout is exceeded, for example, when using systems with fluctuating loads, however, this parameter does not normally need to be set. (The default is 2 times.)

The following type of issue occurs if the tuning related to abnormality monitoring is not performed appropriately.

Notes regarding monitoring when the operating system or server crashes or is unresponsive

Monitoring is performed upon the aforementioned timeout when the operating system or server crashes or is unresponsive. Therefore, if tuning has not been performed correctly, there is a risk of a split-brain mistakenly occurring even if the server is in a sound state.

Split-brain is a phenomenon in which both servers temporarily operate as primary servers, causing data updates to be performed on both servers.

Split-brain detection method

It can be confirmed that split-brain occurs under the following conditions:

When the mc_ctl command is executed in status mode on both servers, the "host_role" of both servers is output as "primary". And,
The following message is output to the event log of one of the servers:
```
promotion processing completed (MCA00062)
```

How to recover from a split-brain

Use the procedure described below. Note that the new primary server is the server that was confirmed in step 2 of the aforementioned detection method.

Stop all applications that are running on the old and new primary servers.
Investigate and recover the database.
Investigate the update results that have not been reflected to the new primary server from the database of the old primary server, and apply to the new primary server as necessary.
Stop the old primary server instance and the Mirroring Controller.
Resume the applications that were stopped in step 1.
Recover the old primary server.
While referring to "8.4 Setting up the Standby Server", build (set up) the old primary server as the new standby server, from the new primary server.

Note

The tuning described above impacts on the time taken from detection of a timeout until switching the primary server. Therefore, modify the values while taking into account the switch or disconnection time, using a design for which misdetection does not occur.