Top
Interstage Big DataParallel Processing ServerV1.0.1 User's Guide
FUJITSU Software

16.1.1 Collecting Troubleshooting Data

Collect the following information required for investigating a fault in the HA cluster system from the primary master server and the secondary master server.

  1. HA cluster troubleshooting data

  2. Crash dump

    If it is possible to collect the crash dump from the server where the fault occurred, collect the crash dump manually before restarting the server.
    The crash dump becomes useful when the problem is with the OS.

    Example: A switchover occurs because of an unexpected resource failure

    When switching of the cluster application has finished, collect the crash dump on the node where the resource failure occurred.

    Refer to "Crash dump" for information on the crash dump.

  3. If a fault can be reproduced, collect the data (in any format) summarizing the procedure to reproduce the fault

Information

When reporting the fault information to Fujitsu technical support, it is necessary for the information required for investigating the fault to be collected correctly. The information collected is used to check the problems and to reproduce the faults. This means that if the information is not accurate, reproducing and diagnosing the problem can take longer than necessary, or render it impossible.

Collect the information for the investigation quickly from the primary master server and the secondary master server. The information to be collected by the fjsnap command, in particular, disappears as time passes from the point when the fault occurred, so special care is needed.


Executing the fjsnap command
  1. Log in to the primary master server and the secondary master server using root permissions.

  2. Execute the fjsnap command on each server.

    # /usr/sbin/fjsnap -a output <Enter>

    Specify the output destination file name for the error information collected with the fjsnap command in the output option.

See

Refer to the README file included with the FJSVsnap package for information on the fjsnap command.

Information

Execution timing for the fjsnap command

For problems occurring during normal operation, such as when an error message is output, execute the fjsnap command immediately after the problem occurs.

Collect the crash dump if the fjsnap command cannot be executed because the system has stopped responding. After that, start in single-user mode and execute the fjsnap command. Refer to "Crash dump" for information on collecting the crash dump.

If the node automatically restarts after a problem occurs (unable to start in single-user mode) or if incorrectly restarted in multiuser mode, execute the fjsnap command.

Collect the crash dump when the troubleshooting data cannot be collected, because the fjsnap command ends in an error or does not produce a return.


Executing the pclsnap command
  1. Log in to the primary master server and the secondary master server using root permissions.

  2. Execute the pclsnap command on each server.

    # /opt/FJSVpclsnap/bin/pclsnap {-a output or -h output} <Enter>

    When -a is specified, all the detailed information is collected, so the data size will be large. Only the cluster control information is collected if -h is specified.

    Specify the output destination and either the file name particular to the output media or the output file name (such as /dev/st0) for the error information collected with the pclsnap command in the output option.

    Specify the path beginning with "./" if specifying a path relative to the current directory where the output file name includes the directory.

See

Refer to the README file included with the FJSVpclsnap package for information on the pclsnap command.

Information

Execution timing for the pclsnap command

For problems occurring during normal operation, such as when an error message is output, execute the pclsnap command immediately after the problem occurs.

Collect the crash dump if the pclsnap command cannot be executed, because the system has stopped responding. After that, start in single-user mode and execute the pclsnap command. Refer to "Crash dump" for information on collecting the crash dump.

If the node automatically restarts after a problem occurs (unable to start in single-user mode) or if mistakenly restarted in multiuser mode, execute the pclsnap command.

Collect the crash dump when the troubleshooting data cannot be collected, because the pclsnap command ends in an error or does not produce a return.

Information

Available directory space required to execute the pclsnap command

The following table is a guide to the available directory space required to execute the pclsnap command.

Directory type

Default directory

Free space (estimate) (MB)

Output directory

Current directory at the time of execution

300

Temporary directory

/tmp

500

Note

The estimates given above (300 MB and 500 MB) may be insufficient in some systems.

If information collection could not be completed successfully due to insufficient directory capacity, the pclsnap command outputs an error or warning message when it ends. In this case, take the action shown below and then re-execute the command.


Action to take when there is insufficient capacity in the output directory

The following error message is output when the pclsnap command is executed but fails to generate an output file.

ERROR:  failed to generate the output file "xxx".
DIAG:  ...
Action:

Change the output directory to a location with a large amount of space available, then re-execute the command.

Example:

When /var/crash is the output directory:

# /opt/FJSVpclsnap/bin/pclsnap -a /var/crash/output <Enter>

Action to take when there is insufficient capacity in the temporary directory

The following warning message may be output when the pclsnap command is executed.

WARNING:  The output file "xxx" may not contain some data files.
DIAG:  ...

If this warning message is output, the output file for the pclsnap command is generated, but the output file may not contain all the information that was meant to be collected.

Action:

Change the temporary directory to a location with a large amount of space available, then re-execute the command.

Example:

When the temporary directory is changed to /var/crash:

# /opt/FJSVpclsnap/bin/pclsnap -a -T /var/crash output <Enter>

If the same warning is output even after changing the temporary directory, investigate the following possible causes:

(1) The state of the system is causing the information collection command to timeout

(2) The files being collected are larger than the free space in the temporary directory


If the problem is (1), the log for the timeout is recorded to pclsnap.elog, part of the files output by the pclsnap command. Collect the crash dump if possible along with the output file of the pclsnap command.

If the problem is (2), check if the sizes of (a) or (b) exceed the capacity of the temporary directory.


(a) Log file size

- /var/log/messages

- Log files in /var/opt/SMAW*/log/ (such as SMAWsf/log/rcsd.log)

(b) Total size of the core file

- GFS core file: /var/opt/FJSVsfcfs/cores/*

- GDS core file: /var/opt/FJSVsdx/*core/*

If these are larger than the capacity of the temporary directory, move the relevant files to a partition separate to the output directory and the temporary directory and re-execute the pclsnap command. Save the files moved. Do not delete them.


Crash dump

In environments where Linux Kernel Crash Dump (LKCD), Netdump, or diskdump is installed, it is possible to collect the crash dump as the troubleshooting data.

Timing for collecting the crash dump
  • If an Oops occurs in the kernel

  • If a panic occurs in the kernel

  • If the <Alt> + <SysRq> + <C> keys were pressed at the system administrator console

  • When the NMI button is pressed on the main unit

The following describes how to collect the crash dump.

  1. How to collect the crash dump after a system panic
    First, check if the crash dumps, for times after the switchover occurred, exist in the directory where the crash dumps are stored. If crash dumps exist from after the time when the switch occurred, collect the crash dumps. If there are no crash dumps from after the time when the switch occurred, collect the crash dumps manually as far as possible.

  2. How to collect crash dumps manually
    Use one of the following methods to collect the crash dumps in the directories where crash dumps are stored:

    - Press the NMI button on the main unit

    - Press the <Alt> + <SysRq> + <C> keys at the console

Directory where the crash dump is saved

The crash dump is saved as a file either on the node where the fault occurred (LKCD or diskdump) or on the Netdump server (Netdump).

The directory where it is saved is /var/crash.