Top
PRIMECLUSTER Global Disk Services  Configuration and AdministrationGuide 4.7

D.1.6 Errors in the Mirroring Among Servers

When one of the following errors occurs, take the actions described below for each error.

(1) The netmirror slice that configures the netmirror volume is in INVALID state.

Explanation

The slice of the netmirror volume becomes INVALID due to the following reasons:

(Cause a)
  • Node stop

  • Error in the network that is used for the mirroring among servers

  • Disk error

(Cause b)

In an Azure environment or a NIFCLOUD environment, the path for the by-id file was specified, and then the iSCSI target was created.

Resolution

The resolution procedure is illustrated below for each of the two causes a and b.

a) For (Cause a)
  1. Start a node if the node is stopped.

    After the node is started, synchronization copying is performed automatically. When the synchronization copying is completed without an error, the slice is restored to be ACTIVE. In this case, no more resolution is necessary.
    If the slice is not be restored, take the following procedure.

  2. Check the status of the network that is used for the mirroring among servers.

    If the network has an error, restore the network.
    After the network is restored, synchronization copying is started automatically within 30 seconds. Wait until all the synchronization copying is completed. In other words, wait until the COPY netmirror slice no longer exists. When the synchronization copying is completed without an error, the slice is restored to be ACTIVE. In this case, no more resolution is necessary.
    If the slice is not restored, take the following procedure.

  3. Check the class status. If the class is not started, restore it.

    Execute the following command on both nodes to check the class status.

    # /etc/opt/FJSVsdx/bin/sdxdcdown

    If the class information is not displayed, the class is not started.
    In this case, see the note "Shutting down the node" of the section "(2) Class cannot be started when booting the system." in "D.1.4 Class Status Abnormality", stop the node in question, and then restart it.
    If the information is still not displayed after the node is restarted, see the section "(2) Class cannot be started when booting the system." in "D.1.4 Class Status Abnormality" to restore.

  4. Check the disk status.

  5. If the disk has a failure and the class is closed at the same time, restore the closed class.

    5-1) Check if the class is closed or not on both nodes.

    Execute the following command on both nodes. If "yes" is displayed in the DOWN field, the class is closed.

    # /etc/opt/FJSVsdx/bin/sdxdcdown

    5-2) Restart the node where the class is closed.

    5-3) If the cluster application where the closed class is registered has been started on any other node except this node, cancel the start lock of the netmirror volume on this node.

    # sdxattr -V -c class_name -v volume_name -a lock=off
  6. If the disk has a failure, swap the disk.

    For how to swap the disk, see "7.3.3 Swapping Disks of Netmirror Group."

    However, if the error status is equal to the conditions described in the following troubleshootings, see "Resolution" of each troubleshooting to restore the error status:

    Take the procedure described in "7.3.3 Swapping Disks of Netmirror Group" first before executing "restore physical disk" (sdxswap -I command) according to "Resolution" described in the above sections.

  7. If the disk does not have an error, execute the sdxinfo -D command on both nodes, and then check if the physical disk name is properly displayed in the DEVNAM field.

    When the asterisk (*) is displayed in the DEVNAM field on one of the nodes, follow the procedure described in "(4) The slice becomes INVALID or the operation stops after restarting the other node when the network error occurs." in "D.1.6 Errors in the Mirroring Among Servers" to restore the error status.

    If the asterisk (*) is displayed in DEVNAM field on both nodes, follow the procedure described in "(6) The class is closed on the last started node at the startup of both nodes." in "D.1.6 Errors in the Mirroring Among Servers" to restore the error status.

  8. If the error status is not restored after taking steps 1 to 7, check that the following errors do not occur and then execute the resolution.

    • When an error and restoration occurs repeatedly in the network or in the node, or when I/O error and network recovery occurs consecutively due to off and on failures.
      Check the output result such as the system log to check the status of the network and the node. If an error is detected, restore this error and then to perform synchronization copying of the netmirror volume, execute the following command on any one node.

      # sdxcopy -B -c class_name -v volume_name
    • When the data between the disks of both nodes are unable to be determined if new or old.
      For details on conditions for this state and restoration method, see "7.16.8 Restoration when the Latest Disk Cannot Be Selected Automatically."

  9. If the slice is not restored after taking the steps 1 to 8, collect the investigation material and contact field engineers.

b) For (Cause b)

Take the following steps to remove the configuration, specify the path for the by-partuuid file, and then create the iSCSI target again.

  1. Restore the disk.

    See "7.3.3 Swapping Disks of Netmirror Group."
    You can choose which of the procedures described in "7.3.3.1 Hot Swap" and "7.3.3.2 Cold Swap" to perform.

    Note that the following steps are not required because you do not need to swap the disk.

  2. Back up the volume data if necessary.

  3. Remove the configuration of GDS and the iSCSI settings used for the mirroring among servers.

    For details, see "Chapter 9 Removing Configuration."

  4. Create a partition in the disk where the gpt disk label is set, specify the path for the by-partuuid file for that partition, and then create the iSCSI target.

    For details, see "4.8.3 Creating iSCSI Target."

  5. Set up GDS.

    For details, see "Chapter 6 Settings."

  6. Restore the volume data if necessary.

(2) The cluster application becomes Faulted or Inconsistent after the node is restarted.

Explanation

If one slice is in ACTIVE or STOP state and the other slice is in other than these two states among two slices that belong to the netmirror volume, this error occurs when restarting the node where the former slice exists.


At this time, the slice that contains the valid data does not exist in the netmirror volume. In this case, the class resource on the node, which has not been restarted, becomes OFF-FAIL, and the cluster application becomes Faulted or Inconsistent.

When the node that is not restarted meets the following conditions, the cluster application becomes Inconsistent.

When the node does not meet the above conditions, the cluster application becomes Faulted.

Resolution

  1. Start the stopped node if it is stopped.

  2. Check that the class is not closed on the node which has not been restarted.

    Execute the following command on the node which has not been restarted. If "yes" is displayed in the DOWN field, the class is closed.

    # /etc/opt/FJSVsdx/bin/sdxdcdown
  3. Restore the class if it is closed in step 2.

    If the class is not closed, this procedure is not required.

    3-1. If the disk is failed, replace the disk for restoration.

    3-2. Check that both nodes are started, and then restore the closed class on the node which has not been restarted.

    Execute the following command.

    # sdxfix -C -c class_name

    3-3. Cancel the "Lock volume" of the netmirror volume.
    Execute the following command on both nodes to check the "Lock volume" attribute.

    # sdxinfo -V -c class_name -e long

    The "Lock volume" attribute is displayed in the LOCK field.

    If the "Lock volume" attribute is "on", execute the following command on the node to cancel the "Lock volume."

    # sdxattr -V -c class_name -v volume_name -a lock=off

    3-4. Perform the synchronization copying if INVALID slice exists in the netmirror volume.

    Execute the following command on any one node.

    # sdxcopy -B -c class_name -v volume_name

    Note

    When the time differs depending on each node in the cluster system, the synchronization copying may not be performed.
    To recover this error, adjust the time on each node to be consistent with each other and then restart both nodes.

  4. If the cluster application is in Faulted or Inconsistent state, clear the Faulted or Inconsistent state.

    For how to clear the Faulted or Inconsistent state, see "PRIMECLUSTER Installation and Administration Guide."

  5. Check the class resource state. If it is OFF-FAIL, restore it.

    • If the started cluster application does not exist

      Restart the cluster application on the node which has not been restarted.

      If you want to start the cluster application on the restarted node, restart the node which has not been restarted.

    • If the only one started cluster application exists

      Restart the standby node.

    • If two or more cluster applications have been started

      1) If the operating node and the standby node exist on one node, switch the cluster application so that only the node which has not been restarted becomes the operating node.

      2) Switch back the cluster application switched in step 1) if necessary.

(3) The cluster application becomes Faulted or Inconsistent when the network error occurs.

Explanation

Resolution

  1. Restore the network.

    After the network is restored, the resynchronization copying is automatically performed.

    When the class is closed, the copying process may fail.

  2. Check the closed class.

    Execute the following command on both nodes. The class has been closed if "yes" is displayed in its DOWN field.

    # /etc/opt/FJSVsdx/bin/sdxdcdown
  3. Restore the class if it is closed in step 2.

    If the class is not closed, this procedure is not required.

    3-1. If the disk is failed, replace the disk for restoration.

    3-2. If the resynchronization copying is performed, cancel the copying process.

    Execute the following command on the node where the class is not closed to check the slice status of the netmirror volume.

    If the class is closed on both nodes, this procedure is not required.

    # sdxinfo -S -c closed_class_name

    When the COPY state slice exists in the netmirror volume, the resynchronization copying is ongoing.

    In this case, execute the following command on the node where the class is not closed to cancel the copying process.

    # sdxcopy -C -c closed_class_name -v volume_name

    3-3. Restore the closed class.

    Execute the following command on the node where the class is closed.

    # sdxfix -C -c closed_class_name

    While restoring the closed class, the synchronization copying is performed.

    The synchronization copying of the whole volume is performed even when the both slices of the netmirror volume are already synchronized.

    3-4. If the synchronization copying is not performed in step 3-3, perform the synchronization copying.

    Execute the following command on any one node.

    # sdxcopy -B -c closed_class_name -v volume_name

    Note

    When the time differs depending on each node in the cluster system, the synchronization copying may not be performed in step 3-3 and 3-4. To recover this error, adjust the time on each node to be consistent with each other and then restart both nodes.

  4. Restore the state of the cluster application on the standby node.

    For how to clear the Faulted state, see "PRIMECLUSTER Installation and Administration Guide."

    To restore the Inconsistent state, take the same procedure as to clear the Faulted state.

  5. Check the class resource state. If it is OFF-FAIL, restore it.

    • If the only one cluster application exists

      Restart the standby node.

    • If multiple cluster applications exist.

      1) If the operating node and the standby node exist on one node, switch the cluster application so that the operating node and the standby node do not exist on the same node.

      2) If the resynchronization copying is performed, and the copy source disk exists on the standby node at the same time, wait until the resynchronization copying is completed.

      3) Restart the standby node.

(4) The slice becomes INVALID or the operation stops after restarting the other node when the network error occurs.

Explanation

If the node is restarted when the network error occur, either of the following errors occur.

Resolution

  1. If the network used for the mirroring among servers has an error, restore this network.

  2. Check the device status.

    Execute the sdxinfo -D command on both nodes.

    For disks that are connected with the netmirror group, check and record the following information.

    The information can be obtained from the output of the sdxinfo -D command. The recorded information is used in step 7 and step 8 later.

    • Check if an asterisk (*) is displayed in the DEVNAM field (physical disk name) of a connected disk. If an asterisk is displayed, record the value in the NAME field (SDX disk name) of the disk. In the example below, the value is d2.

    • Record the node name if an asterisk (*) is displayed in the DEVNAM field (physical disk name).

    Execution example

    # sdxinfo -D
    OBJ    NAME    TYPE      CLASS   GROUP   DEVNAM  DEVBLKS  DEVCONNECT       STATUS
    ------ ------- ------    ------- ------- ------- -------- ---------------- -------
    disk   d1      netmirror cl1     mdg1    sda     1015808  node1:node2      ENABLE
    disk   d2      netmirror cl1     mdg1    *       1015808  *                ENABLE 
  3. Restore the iSCSI device information.

    Execute the following command on the node where an asterisk (*) has been displayed in the DEVNAM field of the disk, which is connected with the netmirror group, in step 2 above.

    # /etc/opt/FJSVsdx/bin/sdxiscsi_ctl -F -e init
  4. Check if the class is closed on both nodes. Restore the class if it is closed.

    For details, see "D.1.4 Class Status Abnormality."

  5. Check the status of "Lock volume" of the netmirror volume.

    Execute the following command on both nodes.

    # sdxinfo -V -c class_name -e long

    The "Lock volume" attribute is displayed in the LOCK field.

  6. If the "Lock volume" attribute is "on" on any one of the nodes, take the following procedures to cancel the "Lock volume."

    • When the "Lock volume" attribute is "on" on all nodes

      Execute the following command on any one of the nodes in the class scope.

      # /etc/opt/FJSVsdx/bin/sdxnetdisk -S -c class_name
    • When the "Lock volume" attribute is "on" on one node

      Execute the following command on the node where the "Lock volume" attribute is "on."

      # sdxattr -V -c class_name -v volume_name -a lock=off

    Make sure that the "Lock volume" attribute is "off" on both nodes after executing the command.

    If the "Lock volume" attribute is not "off", perform the "Restoration Procedure" 1. to 3. of "c." in "7.16.8 Restoration when the Latest Disk Cannot Be Selected Automatically", and then take the above procedure again.

  7. Delete the iSCSI device information on both nodes.

    # rm /var/opt/FJSVsdx/log/.sdxnetmirror_disable.db
    # rm /var/opt/FJSVsdx/log/.sdxnetmirror_timestamp
  8. Check the slice status.

    Execute the following command on any one node. Check the value in the STATUS field of the slice if the value in the DISK field (SDX disk name) of the slice is equal to the SDX disk name recorded in step 2.

    The slice status is INVALID in the execution example below.

    Execution example

    # sdxinfo -S
    OBJ    CLASS   GROUP   DISK    VOLUME  STATUS
    ------ ------- ------- ------- ------- --------
    slice  cl1     mdg1    d1      m1      STOP
    slice  cl1     mdg1    d2      m1      INVALID
  9. If the slice status is ACTIVE or STOP when it is checked in step 8, take the following procedure.

    9-1) If the slice status is INVALID when it is checked in step 7 and this slice belongs to the disk of node which is not recorded in step 2, stop RMS.
    Execute the following command on any node.

    # hvshut -a

    9-2) Restart the node recorded in step 2.

    9-3) If RMS is stopped in step 9-1), start RMS. Execute the following command on any node.

    # hvcm -a
  10. If the INVALID slice exists in the netmirror volume, restore the slice.

    Execute the following commands on any one node.

    Use the -d option to specify the disk that contains the INVALID slice. If the volume is not started on any node, the synchronization copying is performed when the volume is started.

    # sdxswap -O -c class_name -d disk_name
    # sdxswap -I -c class_name -d disk_name
  11. Restore the state of the cluster application.

    For how to clear the Faulted state, see "PRIMECLUSTER Installation and Administration Guide."

    To restore the Inconsistent state, take the same procedure as to clear the Faulted state.

  12. Check the cluster resource state. If it is OFF-FAIL, restore it.

    • If the started cluster application does not exist

      Restart the cluster application on the node which has not been restarted.

      If you want to start the cluster application on the restarted node, restart the node which has not been restarted.

    • If the only one started cluster application exists

      Restart the standby node.

    • If two or more cluster applications have been started

      1) If the operating node and the standby node exist on one node, switch the cluster application so that only the node which has not been restarted becomes the operating node.

      2) Switch back the cluster application switched in step 1) if necessary.

  13. When the service is stopped, restart the service if necessary.

(5) The resource of the class where the netmirror volume exists becomes OFF-FAIL.

Explanation

When the node cannot access the netmirror volume due to an error, the class resource in the node becomes OFF-FAIL.

Resolution

  1. Under the following circumstances, restart the node where the OFF-FAIL class resource exists:

    • Both nodes were restarted just before this error occurred.

    • All the disks in at least one netmirror group are in either of the following status:

      • The disk of the node where the OFF-FAIL class resource exists is in SWAP state or DISABLE state.

      • On the node where the OFF-FAIL class resource exists, an asterisk (*) is displayed in DEVNAM field of the sdxinfo -D command output.

  2. Restore the error state of the GDS object according to the troubleshooting described in the following sections:

    (1) to (4), and (6) in "D.1.6 Errors in the Mirroring Among Servers"

    "D.1.4 Class Status Abnormality"

    "D.1.2 Disk Status Abnormality"

    The class resource is restored successfully once its state is restored back to normal status (ON or OFF-STOP). When the OFF-FAIL class resource exists, go on to step 3. or later.

  3. If the OFF-FAIL class resources exist on both nodes, take the following steps to restore their status.

    3-1) Select the node that is to be restarted.

    3-2) Start the cluster application that matches the following condition on the node that is not to be restarted.

    Condition: The cluster resource is in the OFF-FAIL state on the node that is not be restarted.

    3-3) If only one ACTIVE slice or only one STOP slice exists in the netmirror volume, and this slice is the slice existing on the disk of the node that is selected in step 3-1), restore the INVALID slice or the NOUSE slice in the netmirror volume.

    3-4) Restart the node selected in step 3-1).

  4. If the OFF-FAIL class resource exists on only one of the nodes, take the following steps to restore its state.

    Restore the class resource state according to any one of the conditions described in step 4a) to 4c).

    4a) If all the cluster applications stop

    Take any one of the following steps to restore the state.

    • On the node, start the cluster application that contains the OFF-FAIL class resource.

    • Restart the node on which the class resource is in OFF-FAIL state.

    4b) If the node where the OFF-FAIL class resource exists is the operating node

    On the node, start the cluster application that contains the OFF-FAIL class resource.

    4c) If the node where the OFF-FAIL class resource exists is not the operating node

    Take the following steps to restore the state.

    4c-1) If only one ACTIVE slice or only one STOP slice exists in the netmirror volume, and this slice is the slice existing on the disk of the node where the OFF-FAIL class resource exists, restore the INVALID slice or the NOUSE slice in the netmirror volume.

    4c-2) Restart the node where the OFF-FAIL class resource exists.

(6) The class is closed on the last started node at the startup of both nodes.

Explanation

At the startup of both nodes, if an error has occurred in the network that is used for the mirroring among servers, the class may be closed on the last started node.

At this time, the class resource may become OFF-FAIL also on the first started node. In this case, the cluster application cannot be started.

Resolution

Take different procedures depending on if the cluster application is started or not, and if the network can be restored soon or not.

(a) If the cluster application is started

  1. Stop the node where the class is closed.

  2. Restore the network that is used for the mirroring among servers. After that, take the "Restoration Procedure" described in "7.16.7 Restoration to 2-node Operation after Operation on Only One Node."

(b) If the cluster application is stopped and the network can be restored soon as well

  1. Restore the network that is used for the mirroring among servers.

  2. Restart both nodes.

(c) If the cluster application is stopped but the network cannot be restored soon

  1. Select one of the following procedures:

    • Restart the operation after restoring the network.
      The operation status can be restored after restoring the network and restarting both nodes.
      The data is not restored at this time. Instead, the operation is restarted by using the latest data stored in the disk of the node where the class is closed.

    • Restore the data by the backup data to start the operation.
      In this case, go on to step 2. or later.

  2. Stop the node where the class is closed.

  3. Restore the INVALID state of the slice.
    Execute the following command. For -d option, specify the disk name of the disk that is connected with the node among all the disks belonging to the netmirror group.

    # sdxfix -V -c class_name -v netmirror_volume_name -d disk_name -e force -x NoRdchk
  4. Start the netmirror volume.

    # sdxvolume -N -c class_name -v netmirror_volume_name -e unlock
  5. Restore the data of the netmirror volume.

  6. Stop the netmirror volume.

    # sdxvolume -F -c class_name -v netmirror_volume_name
  7. If the cluster application is in Faulted state, clear the Faulted state.
    For how to clear the Faulted state, see "PRIMECLUSTER Installation and Administration Guide."

  8. Start the cluster application forcibly.
    For how to forcibly start the cluster application, see "PRIMECLUSTER Installation and Administration Guide."

  9. Restore the network that is used for the mirroring among servers. After that, take the "Restoration Procedure" described in "7.16.7 Restoration to 2-node Operation after Operation on Only One Node."

(7) The node is restarted while hot swap of disks is proceeding.

Explanation

When the node is restarted while performing the operation in " 7.3.3.1 Hot Swap" for the disks used for the mirroring among servers, take the actions described below.

Resolution

The resolution depends on whether the restarted node is a node where disks are swapped or not.

Resolution when a node where disks are swapped is restarted

When the node where disks are swapped is restarted, take the actions described below.

The step order described below follows the same steps as "7.3.3.1 Hot Swap."

Timing of restarting

Resolutions

Step 1 proceeding

After the node is restarted, check the slice status of a disk to be swapped.
When the status is NOUSE, execute step 2 and the following steps. (*1)
When the status is other than NOUSE, execute step 1 and the following steps. (*1)

After step 12 is completed, restart the node where disks were swapped, then execute step 13

Step 1 completed, or from step 2 to 6 proceeding

Execute step 2 and the following steps. (*1)
After step 12 is completed, restart the node where disks were swapped, then execute step 13.

Step 6 completed, or step 7 proceeding

After the node is restarted, when a disk swapping is not yet completed, execute step 2 and the following steps.(*1)
When a disk swapping is completed, execute step 8 and the following steps.

After step 12 is completed, restart the node where disks were swapped, then execute step 13.

Step 7 completed, or step 8.9 proceeding

Resume the process that you were executing before the node was restarted.
After step 12 is completed, restart the node where disks were swapped, then execute step 13.

Step 9 completed, or from step 10 to 12 proceeding

Execute step 10 and the following steps.

After step 12 is completed, restart the node where disks were swapped, then execute step 13.

Step 12 completed, or step 13, 14 proceeding

Execute step 13 and the following steps.

Step 14 completed, or step 15 proceeding

After the node is restarted, check the slice status of a disk to be swapped.
When the status is NOUSE, execute step 15.
When the status is other than NOUSE, all steps are completed.

(*1) When the device where "Swap Physical Disk" (or execute the sdxswap -O command) was executed was not created, step 2, 4 ,5, and 6 are not necessary.

Resolution for a node where disks were not swapped is restarted

When the node where disks are not swapped is restarted, disks which can be normally accessed disappear, then the volume fails to be accessed, which belongs to the netmirror group where the disk is swapped.

Take the actions for resolution described below.

  1. Execute steps in "Resolution" except restoring a class resource (step 5), in "(2) The cluster application becomes Faulted or Inconsistent after the node is restarted." of "D.1.6 Errors in the Mirroring Among Servers." However, do not execute a disk swapping. (step3-1).

  2. Resume the procedure for accessing the volume as necessary.

  3. Swap disks.

    The steps for swapping disks depend on the timing when the node where disks are not swapped is restarted.

    The step order described below follows the same steps as "7.3.3.1 Hot Swap."

    Timing of restarting

    Steps for disk swapping

    Step 1 proceeding

    After the node is restarted, check the slice status of a disk to be swapped.
    When the status is NOUSE, execute step 2 and the following steps.
    When the status is other than NOUSE, execute step 1 and the following steps.

    Step 1 completed, or from step 2 to 4 proceeding

    Execute step 2, then resume the process that you were executing before the node was restarted.

    Step 4 completed, or from step 5 to 10 proceeding

    Resume the process that you were executing before the node was restarted.

    Step 10 completed, or from step 11 to 14 proceeding

    Resume the process that you were executing before the node was restarted.

    Step 14 completed, or step 15 proceeding

    After the node is restarted, check the slice status of a disk to be swapped.
    When the status is NOUSE, execute step 15.
    When the status is other than NOUSE, all steps are completed.

  4. Execute the class resource recovery (step 5) in "Resolution" in "(2) The cluster application becomes Faulted or Inconsistent after the node is restarted." of "D.1.6 Errors in the Mirroring Among Servers."

(8) The message regarding an unmount is output when both nodes stop.

Explanation

When both nodes stop, regarding Fsystem resource of the file system on the netmirror volume, the following RMS wizard (RMSWT) message may be output to the /var/log/messages file.

NOTICE: umount mount_point failed with error code 1

WARNING: The file system mount_point was not unmounted.

Resolution

Even if this message is output, the system is not affected. There is no need to work around.