Top
PRIMECLUSTER Global Disk Services  Configuration and AdministrationGuide 4.7

D.1.4 Class Status Abnormality

If the class status is one of the following statuses, take the actions as indicated for the relevant situation.

(1) Class becomes closed status during operation.

Explanation

The class becomes closed when the number of configuration databases which store information on object configuration and object status within a class is insufficient, or when the communication error between nodes occurs in a cluster environment. All objects within a closed class are inaccessible.

An Insufficient number of configuration databases will occur under the following conditions:

  1. When there are no disks that can be accessed normally, if there are two or less disks in ENABLE status.

  2. When there are one or less disks that can be accessed normally, if there are three to five disks in ENABLE status.

  3. When there are two or less disks that can be accessed normally, if there are more than six disks in ENABLE status.

However, in the event of root class, the class will not be closed unless there are no accessible disks.

GDS configuration databases cannot be stored in Dell EMC storage unit BCV devices and target (R2) devices since the devices are overwritten by data in copy source disks. Therefore, GDS does not regard BCV devices and target (R2) devices as "disks that can be accessed normally" described in the above conditions.

Resolution

1) You can check whether or not a class was closed during operation as follows. Do not reboot the system or restart sdxservd daemon, as it will make the checking impossible.

# /etc/opt/FJSVsdx/bin/sdxdcdown
CLASS DOWN REASON NDK NEN NDB NLDB DEVNAM ------- ---- ------ --- --- --- ---- ------------------------------- Class1 no - 10 10 8 0 sda:sdb:sdc:sdd:sde:sdf:sdg:sdh Class2 yes Comm 10 10 8 0 sdi:sdj:sdk:sdl:sdm:sdn:sdo:sdp Class3 yes FewDB 10 10 1 7 sdq Class4 yes NoDB 10 10 0 8 -

In this example, Class2, Class3, and Class4 with "yes" in the DOWN field are closed. The cause shown in the REASON field are as follows.

(Cause 1)

Comm Communication failure between nodes.

(Cause 2)

FewDB Insufficient number of valid configuration databases.

(Cause 3)

NoDB No valid configuration database.


2) Depending on specific causes, recovery may be difficult. First, collect the investigation material.

For information on how to collect the investigation material, see "D.2 Collecting Investigation Material."

Resolutions are described for the following two cases:

  1. Closed due to a communication error

  2. Closed due to an insufficient number of configuration databases


3a) In the event of (Cause 1), contact field engineers.


3b) In the event of (Cause 2) or (Cause 3), all (or the majority) of the disks registered with class have abnormalities.

You can check the disks registered with class as follows.

# sdxinfo -D -c Class3
OBJ NAME TYPE CLASS GROUP DEVNAM DEVBLKS DEVCONNECT STATUS ------ ------- ------ ------- ------- ------- -------- ---------------- ------- disk Disk31 mirror Class3 Group1 sda 8847360 * ENABLE disk Disk32 mirror Class3 Group1 sdb 8847360 * ENABLE disk Disk33 mirror Class3 Group2 sde 8847360 * ENABLE disk Disk34 mirror Class3 Group2 sdf 8847360 * ENABLE disk Disk35 mirror Class3 Group3 sdc 17793024 * ENABLE disk Disk36 mirror Class3 Group3 sdg 17793024 * ENABLE disk Disk37 mirror Class3 Group4 sdd 17793024 * ENABLE disk Disk38 mirror Class3 Group4 sdh 17793024 * ENABLE disk Disk39 spare Class3 Group1 sdr 17793024 * ENABLE disk Disk40 spare Class3 * sds 17727488 * ENABLE

In this example, ten disks from Disk31 to Disk40 are registered with Class3. Physical disk names are shown in the DEVNAM field. Identify the cause of abnormality with these physical disks by referring to disk driver log messages. The cause of abnormality could be either of the following:

(Failure 1)

Failed or defective non-disk component.

(Failure 2)

Failed disk component.


4b) In the event of (Failure 1), recover the failed or defective non-disk component (such as I/O adapter, I/O cable, I/O controller, power supply, and fan).


5b) For a local class or a shared class, execute the sdxfix command to restore the class status.

# sdxfix -C -c Class3
SDX:sdxfix: INFO: Class3: class recovery completed successfully

6b) Open the GDS configuration parameter file with an editor.

# vim /etc/opt/FJSVsdx/sdx.cf

Add the following one line in the end of the file.

SDX_DB_FAIL_NUM=0


7b) Reboot the system.


8b) Confirm that objects within the class are accessible.

# sdxinfo -c Class3

If nothing is displayed, recovery was unsuccessful. You will have to contact field engineers. If information is displayed normally, proceed with the following procedures.


9b) In the event of (Failure 2), where a disk component has failed, follow the procedures in "7.3.1.2 Operation Procedure," or "B.1.8 sdxswap - Swap Disk," and swap the disks.


10b) After completing the recovery for both (Failure 1) and (Failure 2), check the number of valid configuration databases as described below.

# /etc/opt/FJSVsdx/bin/sdxdcdown
CLASS DOWN REASON NDK NEN NDB NLDB DEVNAM ------- ---- ------ --- --- --- ---- --------------------------------------- Class1 no - 10 10 8 0 sda:sdb:sdc:sdd:sde:sdf:sdg:sdh Class2 no - 10 10 8 0 sdi:sdj:sdk:sdl:sdm:sdn:sdo:sdp Class3 no - 10 10 8 0 sdq:sdr:sds:sdt:sdu:sdv:sdw:sdx Class4 no - 10 10 8 0 sdaa:sdab:sdac:sdad:sdae:sdaf:sdag:sdah

NLDB field gives the insufficient number of configuration databases. If this value is "0," the problem is resolved. If this value is "1" or more, there are still disks that have not been recovered. In the above example, all NLDB fields display "0," indicating the successful recovery.
When step 6b) was not performed, the following procedures are not required.


11b) Open the GDS configuration parameter file with an editor.

# vim /etc/opt/FJSVsdx/sdx.cf

Remove the following one line added in step 6b).

SDX_DB_FAIL_NUM=0


12b) Reboot the system.


If you cannot perform the recovery with the described procedures, contact field engineers.

(2) Class cannot be started when booting the system.

Explanation

When booting the system, if the configuration database that contains the configurations and status of objects in the class is not accessible due to an I/O error of the disk or similar causes, the class remains unstarted.

All objects in the class that remains unstarted are not accessible. Also, the class which remains unstarted will not display related objects or any other class information even running the sdxinfo command.

Resolution

1) Identify the cause by referring to the sfdsk and disk driver message output on the console, and recover the class.

2) See (Cause a) and (Cause b) in section "(1) Disk is in DISABLE status." in "D.1.2 Disk Status Abnormality", and take the corresponding resolution measures.

Note

Shutting down the node

If the class is registered in the cluster application, perform the following procedure before shutting down the node.

If the following procedure is not performed, the Offline process of userApplication may not be completed and the shutdown process may be waited.

  1. Check the status of userApplication on each node.

    # hvdisp -a

    If there is no userApplication of which the status is "Unknown", step 2. and subsequent steps do not need to be performed.

    Shut down the node as usual.

  2. Stop RMS and then shut down the node.

    The procedure differs depending on the class status.

    a) If unstarted classes exist in all the nodes

    a-1) Stop userApplication of which the status is not "Unknown" on all the node.

    # hvutil -f userApplication_name

    a-2) Stop RMS on all the nodes.

    Execute the following command on all the nodes.

    # hvshut -L

    Enter "yes" for the warning message displayed during execution.

    a-3) Shut down all the nodes.

    b) If there is a node in which all the classes are started

    b-1) If there is a userApplication of which the status is not "Unknown" on the node where no class is started, switch userApplication to the node where the class is started.

    # hvswitch userApplication_name SysNode_name_where_the_class_is_started

    b-2) Stop RMS.

    Execute the following command on the node where no class is started.

    # hvshut -L

    Enter "yes" for the warning message displayed during execution.

    b-3) Shut down the node where no class is started.

For details of each command, see the manual of each command.

In any of the following situations, contact field engineers, as the corresponding class must be forcibly removed and then recreated.