1.7 Notes When Building a System

This chapter describes notes you should be well aware of when building a PRIMECLUSTER system. Be sure to read through this before you start operation.

Synchronize time on all nodes to configure a cluster system

Connect to the NTP server and synchronize time on all nodes.
If the time is not synchronized on all nodes, a cluster may not operate properly.

For example, the following message is output or if the OnlinePriority attribute of the cluster application is set, the cluster application may not become Online on the desired node because the node, which was the last operating node at RMS startup, cannot be judged correctly.

(WRP, 34) Cluster host host is no longer in time sync with local node. Sane operation of RMS can no longer be guaranteed. Further out-of-sync messages will appear in the syslog.

(WRP, 35) Cluster host host is no longer in time sync with local node. Sane operation of RMS can no longer be guaranteed.

Synchronize time in the slew mode

To synchronize time on each node with NTP, use the slew mode to always adjust the time slowly. Do not choose the step mode, which is used for adjust the time rapidly.
For details, see the manual of OS and so on. Rapid time adjustment using NTP or time adjustment using running date command causes time inconsistency between nodes, which leads to the incorrect operation of cluster system.

Do no set Spanning Tree Protocol to cluster interconnects

If you set Spanning Tree Protocol to cluster interconnects, the access between them is suspended. Thus, a heartbeat communication may fail. Set Disable to the Status of Parameter Setting for STP (Spanning Tree Protocol) through a switching hub.

Do not set a filtering function in routes of cluster interconnects

The cluster interconnects in PRIMECLUSTER bundle multiple lines to perform communication with PRIMECLUSTER's own protocol (ICF protocol). Therefore, they cannot communicate with devices other than cluster nodes connected to the cluster interconnects. Thus, do not set the filtering function in routes of the cluster interconnects.

Set up kernel parameters necessary in a cluster

PRIMECLUSTER is operated by using a system resource. If this resource is insufficient, PRIMECLUSTER may not operate properly.

The volume of resources used in a system is set as a kernel parameter.
It varies depending on an environment on which your system is running. Estimate the volume of applicable resources based on the operation environment.

Moreover, change kernel parameters before building PRIMECLUSTER.
In addition to that, when you change kernel parameters, be sure to restart OS.

See

For details on a parameter value, see "Setup (initial configuration)" of PRIMECLUSTER Designsheets.

Enable system to collect a system dump or a clash dump

If either a system dump or a clash dump cannot be collected, it may take time to investigate the cause when a problem occurs. Moreover, it may not be able to identify its root cause.

Check that you can collect a system dump and a clash dump before building PRIMECLUSTER.

Configure the required Shutdown Facility depending on a server to be used

The required Shutdown Facility varies depending on a server to be used. See "5.1.2 Configuring the Shutdown Facility" to check the required Shutdown Facility according to a server that is to be used. After that, configure it.

Set the time to detect CF heartbeat timeout as necessary

For the time to detect CF heartbeat timeout, you should consider operational volumes at a peak hour, and then set it based on your customer's environment. The value should be about 10 seconds to 1 minute. The default value is 10 seconds.

See

For the method of setting the time to detect CF heartbeat timeout, see "11.3.1 Changing Time to Detect CF Heartbeat Timeout."

Make sure to set the environment variable: RELIANT_SHUT_MIN_WAIT specifying the RMS shutdown wait time

The required time to stop RMS and cluster applications varies depending on an environment. Be sure to estimate its value corresponding to the configuration setup, and then set it.

See

For details on RELIANT_SHUT_MIN_WAIT, see "E.2 Global environment variables" in "PRIMECLUSTER Reliant Monitor Services (RMS) with Wizard Tools Configuration and Administration Guide."

For the method of referring to and changing RMS environment variables, see "E.1 Setting environment variables" in "PRIMECLUSTER Reliant Monitor Services (RMS) with Wizard Tools Configuration and Administration Guide."

Set a communicable address to the IP address of the administrative LAN for Shutdown Facility regardless of the operation status of the cluster

If the IP address to which the availability of communication is dynamically changed according to the operation status of the cluster is set, the Shutdown Facility does not operate properly.
For example, if the IP address that was set to the NIC switching mode (physical IP address takeover) of Global Link Services (hereinafter GLS) is set to the administrative LAN for the Shutdown Facility, the availability of communication is changed depending on the startup or stop of GLS. If GLS is stopped and the communication is disabled, the Shutdown Facility does not operate properly.

Check the information of XSCF, ILOM, or ALOM used by the Shutdown Facility

If the settings are incorrect, the Shutdown Facility does not operate properly. The method of checking the information varies depending on the server type. Check the server type and see "5.1.2.1.1 Checking XSCF Information" for SPARC M10 and M12, "5.1.2.2.1 Checking Console Configuration" for SPARC Enterprise M3000, M4000, M5000, M8000, or M9000, "5.1.2.3.1 Checking Console Configuration" for SPARC Enterprise T5120, T5220, T5140, T5240, T5440, or SPARC T3, T4, T5, T7, S7 series, and "5.1.2.4.1 Checking Console Configuration" for SPARC Enterprise T1000 or T2000. Moreover, see "5.1.2.5.1 Checking XSCF Information" and "5.1.2.5.2 Checking ILOM Information" for Oracle Solaris Kernel Zones.

Set Locked to the mode switch for SPARC Enterprise Mx000 during operation

If the mode switch is not set properly, the forcible stop by SA_pprcir fails.

When specifying a shared disk unit as the hardware for the patrol diagnosis, set up the physical disk name of a shared disk unit to be the same in all nodes.

When the physical disk name of a shared disk unit varies depending on a node, you cannot set a shared disk unit to the hardware for the patrol diagnosis.