1.8.1 Linux

This section describes the availability of cluster system in the following environments in Linux.

Cluster system in the physical environment
Cluster system between guest OSes with the Host OS failover function (KVM)
Cluster system between guest OSes on multiple host OSes (KVM)
Cluster system between guest OSes on one host OS (KVM)
Cluster system between guest OSes on multiple compute nodes (RHOSP)
Cluster system between guest OSes on one compute node (RHOSP)
Cluster system between guest OSes on multiple ESXi hosts (VMware)
Cluster system between guest OSes on one ESXi host (VMware)

The table below summarizes the availability of error detection in each monitored cluster system.

Table 1.1 Availability according to each cluster system configuration (in a physical environment and virtual environment)
Monitoring target	Physical server	KVM			RHOSP		VMware
Monitoring target	Physical server	Cluster system between guest OSes with the Host OS failover function	Cluster system between guest OSes on multiple host OSes	Cluster system between guest OSes on one host OS	Cluster system between guest OSes on multiple compute nodes	Cluster system between guest OSes on one compute node	Cluster system between guest OSes on multiple ESXi hosts	Cluster system between guest OSes on one ESXi host
1. Unit	Y	Y	N	N	Y*1	N	Y*2	N
2. Shared disk and path of disk access	Y	Y	Y	N	Y	N	Y	N
3. Public LAN	Y	Y	Y	N	Y	N	Y	N
4. OS (physical, host OS/ESXi host)	Y	Y	N	N	Y*1	N	Y*2	N
5. OS (guest OS)	-	Y	Y	Y	Y	Y	Y*3	Y*4
6. Service (cluster application)	Y	Y	Y	Y	Y	Y	Y	Y

Service continuity when an error occurs Y: Available, N: Unavailable, - : Excluded

*1 The service can be continued by configuring high availability for compute instances.
For more information on configuring high availability for compute instances, refer to "High Availability for Compute Instances" in "Red Hat OpenStack Platform."

*2 Only when the I/O fencing function is used or VMware vCenter Server functional cooperation and VMware vSphere HA are used, if a hang-up is detected in a guest OS and the guest OS cannot be switched to the standby system automatically, the guest OS will be changed to LEFTCLUSTER state.

*3 When the guest OS cannot be switched to the standby system automatically, the guest OS becomes the LEFTCLUSTER state.

*4 Only when VMware vCenter Server functional cooperation is used, the guest OS can be switched automatically.

Figure 1.17 Physical environment

Figure 1.18 Virtual environment

For the RHOSP environment, read "host OS" as "compute node". For the VMware environment, read "host OS" as "ESXi host."

How to detect an error in the following targets to be monitored

Unit
For PRIMEQUEST 2000, the asynchronous monitoring linked with Management Board (MMB), and for PRIMEQUEST3000, the asynchronous monitoring linked with iRMC/MMB, immediately detects a panic or a reset triggered by an error in CPU, memory, or others, and the service is switched to the standby system. For PRIMERGY and virtual environments, an error is detected by the heartbeat monitoring, and the service is switched to the standby system. *1
Shared disk and path of disk access
Combining with the volume management function (GDS), the system detects a failure of a disk access or disk access path (monitored by the Gds resource), and the service is switched to the standby system when the disk cannot be accessed or a failure of the whole system of the disk access path occurs.
Public LAN
Combining with the network multiplexing function (Global Link Services, hereinafter referred to as GLS), the system detects a failure of a network adapter or a route in the public LAN (monitored by the Gls resource), and the service is switched to the standby system when a failure of the whole system of the network occurs.
OS (physical and host OS/ESXi host)
An error is detected by the heartbeat monitoring, and the service is switched to the standby system. *1
OS (guest OS)
An error is detected by the heartbeat monitoring, and the service is switched to the standby system.
Service (cluster application)
When a resource error of the cluster application occurs, the service is switched to the standby system.

*1 For the cluster system between guest OSes (RHOSP, VMware) on different host OSes, the status becomes LEFTCLUSTER. After the guest OS is restarted by high availability configuration for compute instances (RHOSP) or the vSphere HA function (VMware), LEFTCLUSTER state of the guest OS is automatically cleared and the service is switched to the standby system.

This section describes the availability of cluster systems in the following environments in Linux.

Cluster system between guest OSes (FJcloud-O)
Cluster system in multiple zones (NIFCLOUD)
Cluster system in a single zone (NIFCLOUD)
Cluster system between Bare Metal servers (FJcloud-Baremetal)
Cluster system in multiple Availability Zones (Multi-AZ) (AWS)
Cluster system in a single Availability Zone (Single-AZ) (AWS)
Cluster system in multiple Availability Zones (Azure)
Cluster system in a single Availability Zone (Azure)

The table below summarizes the availability of error detection in each monitored cluster system.

Table 1.2 Availability according to each cluster system configuration (in a cloud environment)
Monitoring target	FJcloud-O	NIFCLOUD		FJcloud-Baremetal	AWS		Azure
Monitoring target	Cluster system between guest OSes	Cluster system in multiple zones	Cluster system in a single zone	Cluster system between Bare Metal servers	Cluster system in multiple Availability Zones (Multi-AZ)	Cluster system in a single Availability Zone (Single-AZ)	Cluster system in multiple Availability Zones	Cluster system in a single Availability Zone
1. AZ/Zone	N	Y *1	N	- *2	Y	N	Y *1	N
2. Disk	Y	Y	Y	Y	Y	Y	Y	Y
3. Public LAN	Y	Y	Y	Y	Y	Y	Y	Y
4. OS (guest OS)	Y	Y	Y	Y	Y	Y	Y	Y
5. Service (cluster application)	Y	Y	Y	Y	Y	Y	Y	Y
6. Bare Metal server	-	-	-	Y	-	-	-	-

Service continuity when an error occurs Y: Available, N: Unavailable, - : Excluded

*1 An error is detected in AZ (Azure) or a zone (NIFCLOUD), and the node becomes LEFTCLUSTER. Continue the operation by recovering the LEFTCLUSTER state. For how to recover from the LEFTCLUSTER state, refer to "PRIMECLUSTER Cluster Foundation (CF) Configuration and Administration Guide."

*2 There is no AZ in East Japan region 3 and West Japan region 3, where an FJcloud-Baremetal environment is provided.

Figure 1.19 FJcloud-O environment

How to detect an error in the following targets to be monitored

AZ
AZ is not a target to be monitored.
Disk
Combining with the volume management function (GDS), the system detects an error of a disk access (monitored by the Gds resource), and the service is switched to the standby system when the disk cannot be accessed.
Public LAN
Combining with the network multiplexing function (GLS), the system detects a failure of a network adapter or a route in the public LAN (monitored by the Gls resource), and the service is switched to the standby system when a failure of the whole system of the network occurs.
OS (guest OS)
An error is detected by the heartbeat monitoring, and the service is switched to the standby system.
Service (cluster application)
When a resource error of the cluster application occurs, the service is switched to the standby system.

Figure 1.20 NIFCLOUD environment

How to detect an error in the following targets to be monitored

Zone
The cyclic monitoring of the cluster interconnect detects an error of a zone, and the node becomes LEFTCLUSTER.
Disk
GDS monitors I/O to a disk, and when an error of the disk access occurs, the disk is detached and the service continues.
If an I/O error occurs in all slices in a mirror, the service is automatically switched to the standby system.
Public LAN
The network monitoring using ICMP detects a route failure, and the service is automatically switched to the standby system.
OS (guest OS)
The cyclic monitoring of the cluster interconnect detects an error of the guest OS, and the service is automatically switched to the standby system.
Service (cluster application)
When a resource error of the cluster application occurs, the service is automatically switched to the standby system.

Figure 1.21 FJcloud-Baremetal environment

How to detect an error in the following targets to be monitored

2. Disk

Combining with the volume management function (GDS), the system detects an error of a disk access (monitored by the Gds resource), and the service is switched to the standby system when the disk cannot be accessed.

3. Public LAN

Combining with the network multiplexing function (GLS), the system detects a failure of a network adapter or a route in the public LAN (monitored by the Gls resource), and the service is switched to the standby system when a failure of the whole system of the network occurs.

4. OS (guest OS)

An error is detected by the heartbeat monitoring, and the service is switched to the standby system.

5. Service (cluster application)

When a resource error of the cluster application occurs, the service is switched to the standby system.

6. Bare Metal server

An error is detected by the heartbeat monitoring, and the service is switched to the standby system.

See

When using VMware, refer to "1.8.1.1 Physical environment and virtual environment."

Figure 1.22 AWS environment

How to detect an error in the following targets to be monitored

AZ
An error is detected by the heartbeat monitoring, and the service is automatically switched.
Disk
Combining with the volume management function (GDS), the system detects an error of a disk access (monitored by the Gds resource), and the service is switched to the standby system when the disk cannot be accessed.
Public LAN
By registering scripts for control to the Cmdline resource, the system detects a route failure, and the service is switched to the standby system in the event of a network failure.
OS (guest OS)
An error is detected by the heartbeat monitoring, and the service is switched to the standby system.
Service (cluster application)
When a resource error of the cluster application occurs, the service is switched to the standby system.

Figure 1.23 Azure environment

How to detect an error in the following targets to be monitored

AZ
An error is detected by the heartbeat monitoring, and the node becomes LEFTCLUSTER.
Disk
Combining with the volume management function (GDS), the system detects an error of a disk access (monitored by the Gds resource), and the service is switched to the standby system when the disk cannot be accessed.
Public LAN
By registering scripts for control to the Cmdline resource, the system detects a route failure, and the service is switched to the standby system in the event of a network failure.
OS (guest OS)
An error is detected by the heartbeat monitoring, and the service is switched to the standby system.
Service (cluster application)
When a resource error of the cluster application occurs, the service is switched to the standby system.

1.8.1 Linux

1.8.1.1 Physical environment and virtual environment

1.8.1.2 Cloud environment