11.4.2 Troubleshooting

The following figure shows the troubleshooting flow when a hardware or similar fault occurs.

Figure 11.3 Troubleshooting flow (when a fault occurs during replication)

Note

Refer to "11.4.1 Overview" for details of the Status column and "fault location".
If the Status column is "?????", check if the copy processing is in the error suspend status ("failed") or the hardware suspend status ("halt") using ETERNUS Web GUI.
If the copy processing is in either of these states, take the action indicated in the above troubleshooting flow.
In other cases, take the action checked in the following points.
- If device information is unusual:
  Restore the device information.
- If a device is not accessible:
  Check if the device exists.
- If no dependency is configured between volumes and AdvancedCopy Manager service:
  Configure the dependency. For details, refer to "13.1.5 Notes on cluster operation".
- If there is anything unusual with Managed Server, switches, etc.:
  Contact a Fujitsu system engineer.

Use ETERNUS Web GUI to check the error codes. Use the following two methods to check.

Checking with swsrpstat (Operation status display command)
Execute the command with the -O option.
Checking with ETERNUS Web GUI
1. On the [Display status] menu, click [Advanced Copy status display] in the status display.
2. At "Session status", click the "Number of active sessions" link for the relevant copy type.
3. Refer to the value in the "Error code" column of the relevant copy process.

The following table shows the meanings of the error codes.

Table 11.9 Meanings of error codes
Error code	Meaning
0xBA	If a) or b) below applies, a bad sector was created in the transaction volume. QuickOPC has not yet performed physical copying and tracking is in progress EC/REC is in the suspend status (replication established status) Note: If a bad sector is created in a transaction volume when a) or b) applies, the ETERNUS Disk storage system automatically changes the copy processing to the error suspend state. This prevents a restart of QuickOPC or EC/REC resume and prevents the copy destination volume from being overwritten with invalid copy source volume data.
0xBB	A lack of free space has occurred in the Snap Data Volume or Snap Data Pool
Other than 0xBA and 0xBB	An error other than the above occurred.

Error codes 0xBA and 0xBB are returned only for the following ETERNUS Disk storage system:

ETERNUS4000 and ETERNUS8000 (firmware version V11L30-0000 or later)

For ETERNUS Disk storage system other than the above, the events indicated by error code 0xBA and 0xBB are identified by the following methods:

Table 11.10 Error code events
Event	Identification method
Events indicated by 0xBA	These events do not occur. In cases a) and b) above, the copy status does not change even if a bad sector occurs at the copy source volume.
Events indicated by 0xBB	Use ETERNUS Web GUI to check the capacity already used on the Snap Data Volume in order to determine whether or not a lack of free space has occurred. On the [Display status] menu, click [Volume list] in the status display. Click the link to Snap Data Volume in the "Volume type" column of the relevant volume. Refer to the value shown in the "Capacity already used" column. If this event applies, refer to "11.4.2.3 Troubleshooting when a lack of free space has occurred in the Snap Data Volume or Snap Data Pool".

11.4.2.1 Hardware error on a replication volume

When a hardware error occurs in a duplicate volume, perform the repair work on the error according to the following procedures.

Use swsrpcancel (Replication cancellation command) to cancel the processing in which the error occurred. If the processing cannot be cancelled from the operation server when inter-server replication is performed, cancel it from a non-operational server.
If the processing cannot be cancelled by using the command, use ETERNUS Web GUI to cancel it.
Execute swsrprecoverres (Resource adjustment command)
Execute swsrpstat (Operation status display command) to verify that no other errors have occurred.
Use swsrpdelvol (Replication volume information deletion command) to delete the replication volume in which the error occurred.
Use swsrpsetvol (Replication volume information setting command) to register a new replication volume. If the replication volume in which the error occurred is repaired and reused, execute the option [Collect or reflect the information for a specific device] from the Web Console and store the information again in the replication volume.
Re-execute the processing in which the error occurred.

11.4.2.2 Troubleshooting if a bad sector occurred in the copy source volume

If a bad sector occurred in the copy source volume, use the following procedure to restore the copy source volume:

Use swsrpcancel (Replication cancellation command) to cancel processing for which the error occurred.
If inter-server replication was being performed and cancellation is not possible from the active server, cancel processing from the inactive server.
If processing cannot be cancelled using commands, use ETERNUS Web GUI to cancel it.
Execute swsrpstat (Operation status display command) to check for other errors.
Restoration is performed by overwriting the area containing the bad sector. Select the appropriate method, in accordance with the usage or use status of the copy source volume, from the methods below.
- Restoration method 1
  If the area can be reconstructed from high-level software (file system, DBMS, or similar), reconstruct the area.
- Restoration method 2
  If the area containing the bad sector is an area that is not being used, such as an unused area or a temporary area, use a system command (for example, the UNIX dd command or the Windows format command) to write to the area.
- Restoration method 3
  Use swsrpmake (Replication creation command) to restore the data from the copy destination volume. (Restoration is also possible from the copy destination volume of the copy process for which the bad sector occurred.)

11.4.2.3 Troubleshooting when a lack of free space has occurred in the Snap Data Volume or Snap Data Pool

A Snap Data Volume lack of free space occurs when the Snap Data Pool is not being used, whereas a Snap Data Pool lack of free space occurs when the Snap Data Pool is being used.
If a lack of free space occurs of Snap Data Volume or Snap Data Pool, refer to the following sections to recover it according to the Snap Data Pool usage condition:

When not using the Snap Data Pool : "Recovery of insufficient free space in Snap Data Volume"
When using the Snap Data Pool : "Recovery of insufficient free space in Snap Data Pool"

Point

The use status of the Snap Data Pool can be checked by specifying "poolstat" subcommand in swstsdv (Snap Data Volume operation/reference command).

Recovery of insufficient free space in Snap Data Volume

When a lack of free space has occurred in the Snap Data Volume, follow these steps to undertake recovery:

Cancel the processing in which the error occurred with swsrpcancel (Replication cancellation command) command.
If inter-server replication was being performed and cancellation is not possible from the active server, cancel processing from the inactive server.
If processing cannot be cancelled using commands, use ETERNUS Web GUI to cancel it.

The likely causes of a lack of free space in the Snap Data Volume are as follows:

The estimate of the physical size of the Snap Data Volume is not accurate.
The estimate of the physical size of the Snap Data Volume is accurate but, as a result of a large volume being updated in the Snap Data Volume when the SnapOPC/SnapOPC+ session does not exist, the physical capacity of the Snap Data Volume is being used up.

The usage status of the Snap Data Volume can be checked by specifying "stat" in swstsdv (Snap Data Volume operation/reference command) subcommand.

If "a." applies, re-estimate the physical size of the Snap Data Volume, and recreate the Snap Data Volume.
If "b." applies, use ETERNUS Web GUI or, specify "init" in swstsdv (Snap Data Volume operation/reference command) subcommand, and then initialize the Snap Data Volume.

Recreation of the partition (slice) is required after recreation/initialization of the Snap Data Volume.

Recovery of insufficient free space in Snap Data Pool

When a lack of free space has occurred in the Snap Data Pool, follow these steps to undertake recovery:

Cancel the processing in which the error occurred with swsrpcancel (Replication cancellation command).
If inter-server replication was being performed and cancellation is not possible from the active server, cancel processing from the inactive server.
If processing cannot be cancelled using commands, use ETERNUS Web GUI to cancel it.

The likely causes of a lack of free space in the Snap Data Pool are as follows:

The estimate of the size of the Snap Data Pool is not accurate.
The estimate of the size of the Snap Data Pool is accurate but, as a result of a large volume being updated in the Snap Data Volume when the SnapOPC/SnapOPC+ session does not exist, the capacity of the Snap Data Pool is being used up.

The use status of the Snap Data Pool can be checked by specifying "poolstat" in swstsdv (Snap Data Volume operation/reference command) subcommand.

If "a." applies, re-estimate the size of the Snap Data Pool, and after increasing the size of the Snap Data Pool, recreate the Snap Data Volume.

If "b." applies, use ETERNUS Web GUI or, specify "init" in swstsdv (Snap Data Volume operation/reference command) subcommand, then initialize the Snap Data Volume.

Recreation of the partition (slice) is required after recreation/initialization of the Snap Data Pool.

11.4.2.4 Error (halt) on a remote copy processing

The REC restart (Resume) method varies, depending on the halt status.

Execute swsrpstat (Operation status display command) with the -H option specified to check the halt status, and then implement the relevant countermeasure.

For "halt(use-disk-buffer)" or "halt(use-buffer)"
This status means that data is saved to the REC Disk buffer or REC buffer because data cannot be transferred due to a path closure (halt).
In order to restart REC, perform path recovery before a space shortage occurs for the REC Disk buffer or REC buffer.
After recovery, the ETERNUS Disk storage system restarts REC automatically.
If a space shortage has already occurred for the REC Disk buffer or REC buffer, the "halt(sync) or halt (equivalent)" status shown below occurs. Implement the countermeasures for that status.
For "halt(sync) or halt(equivalent)"
This status means that data transfer processing was discontinued due to a path closure (halt).

The REC restart method differs for different REC Recovery modes.

For the Automatic Recovery mode

Remove the cause that made all paths close (halt).
ETERNUS Disk storage system automatically restarts (Resume) REC.

For the Manual Recovery mode

Remove the cause that made all paths close (halt).

Use swsrpmake (Replication creation command) to forcibly suspend the REC that is in the halt status.

[For volume units]
swsrpmake -j < replication source volume name > <replication destination volume name >

[For group units]
swsrpmake -j -Xgroup <group name>

Use swsrpstartsync (Synchronous processing start command) to restart (Resume) the REC. The -t option must be specified if REC is being restarted after a forcible suspend.
[For volume units] swsrpstartsync -t <replication source volume name > <replication destination volume name >
[For group units] swsrpstartsync -t -Xgroup <group name>