Top
Interstage Big DataParallel Processing ServerV1.0.0 User's Guide
Interstage

6.1.2 Collecting Trouble shooting data

This section describes how to collect the trouble shooting data that Fujitsu technical staff need to determine the causes of problems occurring at systems using this product.


This product has the following trouble shooting data:

For trouble shooting data for problems related to the "B.3.2 Other Messages" in the messages output in Interstage Big Data Parallel Processing Server logs and the messages output in system logs ("/var/log/messages"), the trouble shooting data for specific functions incorporated in this product must also be collected.

Refer to "6.1.2.3 Function-specific trouble shooting data" for information on how to collect function-specific data.


6.1.2.1 Interstage Big Data Parallel Processing Server logs

This section describes the Interstage Big Data Parallel Processing Server logs.


Interstage Big Data Parallel Processing Server logs are output for each process.

The process-specific logs are shown in the table below.

Refer to the information provided by Apache Hadoop for information on the logs specific to Apache Hadoop.

Table 6.1 Log types and storage locations

Log type

Log file and storage location

Install/uninstall logs

/var/tmp/bdpp_install.log

Setup/unsetup logs

/var/opt/FJSVbdpp/log/bdpp_setup.log

Interstage Big Data Parallel Processing Server start and stop processing logs

/var/opt/FJSVbdpp/log/bdpp_hadoop.log

/var/opt/FJSVbdpp/log/bdpp_pcl_hadoop.log

Slave server addition and deletion processing logs

/var/opt/FJSVbdpp/log/bdpp_ROR.log


6.1.2.2 Trouble shooting data Collection Tools

This section describes the Interstage Big Data Parallel Processing Server trouble shooting data collection tools.


Execute the tools shown in the following table and collect the output results as trouble shooting data.

Table 6.2 Trouble shooting data collection tool types and location where the data is stored

Trouble shooting data collection tool type

Trouble shooting data collection tool and location where the data is stored

Installation configuration display tool

/opt/FJSVbdpp/bin/sys/bdpp_show_config.sh

Installation status display tool

/opt/FJSVbdpp/bin/sys/bdpp_show_status.sh


6.1.2.3 Function-specific trouble shooting data

trouble shooting data must be collected for each of the functions incorporated in this product.


6.1.2.3.1 If HA cluster setup or switching problems occurred

This section explains how to collect trouble shooting data when there is a problem setting up an HA cluster or when switching.


How to collect trouble shooting data

Collect the following information required for investigating a fault in the HA cluster system from the master server (primary) and the master server (secondary).

  1. HA cluster trouble shooting data

    • Use fjsnap to collect the information necessary to investigate the fault.
      Refer to "Executing the fjsnap command".

    • Collect the trouble shooting data for the system.

  2. Crash dump

    If it is possible to collect the crash dump from the server where the fault occurred, collect the crash dump manually before restarting the server.

    The crash dump becomes useful when the problem is with the OS.

    Example: A switchover occurs because of an unexpected resource failure

    Once switching of the cluster application has finished, collect the crash dump on the node where the resource failure occurred.

    Refer to "Crash dump" for information on the crash dump.

  3. Manual for reproducing faults if fault reproduction is possible

Information

When reporting fault information to Fujitsu technical support, it is necessary for the information necessary to investigate the fault to be collected correctly. The information collected is used to check the problems and to reproduce the faults. This means that if the information is not accurate, reproducing and diagnosing the problem can take longer than necessary, or render it impossible.

Collect the information for the investigation quickly from the master server (primary) and the master server (secondary). The information to be collected by fjsnap, in particular, disappears as time passes from the point when the fault occurred, so special care is needed.


Executing the fjsnap command
  1. Log in to the master server (primary) and the master server (secondary) with root permissions.

  2. Execute the fjsnap command on each server.

    # /usr/sbin/fjsnap -a output <Enter>

    Specify the output destination file name for the error information collected with the fjsnap command in the "output" option.

See

Refer to the README file included with the FJSVsnap package for information on the fjsnap command.

Information

Execution timing for the fjsnap command

For problems occurring during normal operation, such as when an error message is output, execute the fjsnap command as soon after the problem occurs as possible.

Collect the crash dump if the fjsnap command cannot be executed due to a system hang-up. After that, start in single-user mode and execute the fjsnap command. Refer to "Crash dump" for details about collecting the crash dump.

If the node automatically restarts after a problem occurs (unable to start in single-user mode) or if incorrectly restarted in multi-user mode, execute the fjsnap command.

Collect the crash dump when trouble shooting data cannot be collected because the fjsnap command ends in an error or does not produce a return.


Executing the pclsnap command
  1. Log in to the master server (primary) and the master server (secondary) with root permissions.

  2. Execute the pclsnap command on each server.

    # /opt/FJSVpclsnap/bin/pclsnap -a output or -h output <Enter>

    When -a is specified, all detailed information is collected, so the data size will be large. Only the cluster control information is collected if -h is specified.

    Specify the output destination and either the file name particular to the output media or the output file name (/dev/st0, etc.) for the error information collected with the pclsnap command in the "output" option.

    Specify the path beginning with "./" if specifying a path relative to the current directory where the output file name includes the directory.

See

Refer to the README file included with the FJSVpclsnap package for information on the pclsnap command.

Information

Execution timing for the pclsnap command

For problems occurring during normal operation, such as when an error message is output, execute the pclsnap command as soon after the problem occurs as possible.

Collect the crash dump if the pclsnap command cannot be executed due to a system hang-up. After that, start in the single user mode and execute the pclsnap command. Refer to "Crash dump" for details about collecting the crash dump.

If the node automatically restarts after a problem occurs (unable to start in single-user mode) or if mistakenly restarted in multi-user mode, execute the pclsnap command.

Collect the crash dump when trouble shooting data cannot be collected because the pclsnap command ends in an error or does not produce a return.

Information

Available directory space required to execute the pclsnap command

The following table is a guide to the available directory space required to execute the pclsnap command.

Directory type

Default directory

Free space (guide) (MB)

Output directory

Current directory at the time of execution

300

Temporary directory

/tmp

500

Note

The guide values given above (300MB and 500MB) may be insufficient in some systems.

If information collection could not be completed successfully due to an insufficiency in the capacity of the directory, the pclsnap command outputs an error or warning message when it ends. In this case, take the action shown below and then rerun the command.

Action to take when there is insufficient capacity in the output directory

The following error message is output when the pclsnap command is executed but failed to generate an output file.

ERROR: failed to generate the output file "xxx".
DIAG: ...
Action:

Change the output directory to a location with a large amount of available space, then re-execute the command.

Example:

When /var/crash is made the output directory

# /opt/FJSVpclsnap/bin/pclsnap -a /var/crash/output <Enter>
Action to take when there is insufficient capacity in the temporary directory

The following warning message may be output when the pclsnap command is executed.

WARNING: The output file "xxx" may not contain some data files.
DIAG: ...

If this warning message is output, the output file for the pclsnap command is generated, but the output file may not contain all the information that was meant to be collected.

Action:

Change the temporary directory to a location with a large amount of available space, then re-execute the command.

Example:

When the temporary directory is changed to /var/crash

# /opt/FJSVpclsnap/bin/pclsnap -a -T/var/crash output <Enter>

If the same warning is output even after changing the temporary directory, investigate the following possible causes.

(1) The state of the system is causing the information collection command to timeout

(2) The files being collected are larger than the free space in the temporary directory


If the problem is (1), the log for the timeout is recorded to pclsnap.elog, part of the files output by the pclsnap command. Collect the crash dump if possible along with the output file of the pclsnap command.

If the problem is (2) see if the sizes of (a) or (b) exceed the capacity of the temporary directory.


(a) Log file size

- /var/log/messages

- Log files in /var/opt/SMAW*/log/ (SMAWsf/log/rcsd.log, etc.)

(b) Total size of the core file

- GFS core file: /var/opt/FJSVsfcfs/cores/*

- GDS core file: /var/opt/FJSVsdx/*core/*

If these are larger than the capacity of the temporary directory, move the relevant files to a partition separate to the output directory and temporary directory and re-execute the pclsnap command. Save the files moved. Do not delete them.


Crash dump

In environments where Linux Kernel Crash Dump (LKCD), Netdump, or diskdump is installed, it is possible to collect the crash dump as trouble shooting data.

Timing for collecting the crash dump
  • If an Oops occurs in the kernel

  • If a panic occurs in the kernel

  • If <Alt>+<SysRq>+<C> key was pressed at the system administrator console

  • When the NMI button is pressed on the main unit

The following describes how to collect the crash dump.

  1. How to collect the crash dump after a system panic
    First check if the crash dumps for times after the switchover occurred exist in the directory where the crash dumps are stored. If there are crash dumps from after the time that the switch occurred, collect the crash dumps. If there are no crash dumps from after the time that the switch occurred, collect the crash dumps manually as far as possible.

  2. How to collect crash dumps manually
    Use one of the following methods to collect the crash dumps to collect the crash dumps in the directories where crash dumps are stored.

    - Press the NMI button on the main unit

    - Press the <Alt>+<SysRq>+<C> key at the console

Directory where the crash dump is saved

The crash dump is saved as a file either on the node where the fault occurred (LKCD or diskdump) or on the Netdump server (Netdump).

The directory where it is saved is /var/crash.


6.1.2.3.2 If cloning image creation or cloning problems occur when adding or deleting slave servers

The following describes how to collect trouble shooting data if cloning image creation or cloning problems occur when adding or deleting slave servers.

Refer to "Chapter 15 Troubleshooting" in the "ServerView Resource Orchestrator Virtual Edition V3.0.0 Operation Guide" for details.

Types of trouble shooting data

Collect the trouble shooting data when a problem occurs on the system where this product is being used so that Fujitsu technical support can investigate the problem.

There are two types of trouble shooting data. Collect the data that is required for the purposes described below.

  1. Collecting initial trouble shooting data

    Collect data required for initial triage of the cause of the problem that occurred and contact Fujitsu technical support.
    The amount of information collection is small and so can be easily sent by means such as email.
    Refer to "Collecting initial trouble shooting data" for details.

  2. Collecting detailed trouble shooting data

    It is sometimes possible to determine the cause using just the initial trouble shooting data, but more trouble shooting data may be required for some problems.

    It is then necessary to collect more detailed trouble shooting data. Detailed investigation involves collecting a large number of resources that are needed to determine the cause of a problem that has occurred.

    Consequently, the information collected will be larger than the initial trouble shooting data collected to triage the problem.

    If requested by Fujitsu technical support, send the detailed trouble shooting data that has been collected.

    Refer to "Collecting detailed trouble shooting data" for details.

Note

Collect the trouble shooting data quickly when a problem occurs. The information required to investigate a problem disappears as time passes.


Collecting initial trouble shooting data

This section describes how to collect trouble shooting data needed to triage the cause of a problem.

How to collect trouble shooting data

Collect the initial trouble shooting data using the following procedure.

Collect the resources using the appropriate method, depending on the characteristics of the collection method and the environment and system where the problem occurred.

  • Collecting resources from the master server

    Collect trouble shooting data from the master server (rcxadm mgrctl snap -all)

    trouble shooting data from the managed servers can be collected in a batch using the network, so this method is much simpler than executing the command on each individual managed server.

    Refer to "Collecting diagnostics data from the admin sever (rcxadm mgrctl snap -all)" and collect the information.

    Along with the 65MB of available space required to execute the rcxadm mgrctl snap -all command, approximately 30MB of space is required for each server.

  • Collecting resources from the servers

    Collect trouble shooting data from the servers (rcxadm mgrctl snap, rcxadm agtctl snap)

    Refer to "Collecting resources from the servers (rcxadm mgrctl snap, rcxadm agtctl snap)" and collect the information.

    65MB of available space is required to execute the rcxadm mgrctl snap command.

    30MB of available space is required to execute the rcxadm agtctl snap command.

Collecting diagnostics data from the admin sever (rcxadm mgrctl snap -all)

By executing the command for collecting trouble shooting data (rcxadm mgrctl snap -all) on the admin server, the trouble shooting data for the managed servers is collected in a batch.

The following describes collecting trouble shooting data with the command (rcxadm mgrctl snap -all).

How to collect

Use the following procedure to collect the resources on the admin server.

  1. Log in to the admin server as a user with OS administrator privileges.

  2. Execute the rcxadm mgrctl snap -all command.

    # /opt/FJSVrcvmr/bin/rcxadm mgrctl snap [-dir directory] -all  <Enter>
  3. Send the information collected to Fujitsu technical support.

Note

  • When collecting resources from the admin server, the manager must be operating on the admin server. If the manager is not operating, collect the resources on the individual servers.

  • The trouble shooting data cannot be collected from the managed servers in the following cases.

    - When a communications route has not been established

    - If there is a managed server that is stopped

In either case, collection of trouble shooting data on the other managed servers continues uninterrupted.

Check the command execution results in the execution log.

Refer to "rcxadm mgrctl" in the "ServerView Resource Orchestrator Virtual Edition V3.0.0 Command Reference" for details.

When collection has failed on a managed server, either execute the rcxadm mgrctl snap -all on the admin server again or execute the rcxadm agtctl snap command on the managed server.

Collecting resources from the servers (rcxadm mgrctl snap, rcxadm agtctl snap)

Apart from the rcxadm mgrctl snap -all command that is executed on the admin server and that can collect the trouble shooting data in a batch from the managed servers, there are the rcxadm mgrctl snap and rcxadm agtctl snap commands that collect the information only on the server where they are executed.

The following describes collecting trouble shooting data with the command (rcxadm mgrctl snap or rcxadm agtctl snap).

How to collect

Use the following procedure to collect the resources on the servers.

  1. Log into the server with OS administrator privileges.

  2. Execute the rcxadm mgrctl snap or rcxadm agtctl snap command.
    Note that the command executed depends on the server where the resources are collected.

    When collecting on a master server

    # /opt/FJSVrcvmr/bin/rcxadm mgrctl snap [-dir directory] <Enter>

    When collecting on a slave server

    # /opt/FJSVrcxat/bin/rcxadm agtctl snap [-dir directory] <Enter>
  3. Send the information collected to Fujitsu technical support.

Refer to "rcxadm agtctl" or "rcxadm mgrctl" in the "ServerView Resource Orchestrator Virtual Edition V3.0.0 Command Reference" for details.

Collecting detailed trouble shooting data

This section describes how to collect detailed trouble shooting data needed to determine the cause of a problem.

When the cause of the problem cannot be determined just from the initial trouble shooting data, more detailed trouble shooting data is required.

How to collect trouble shooting data

The trouble shooting data required to determine the cause of a problem is collected by executing the trouble shooting data collection commands (rcxadm mgrctl snap -full and rcxadm agtctl snap -full) on the servers.

80MB of available space is required to execute this feature.

How to collect

On the server where resources are to be collected, use the following procedure to collect the resources.

  1. Log into the server with OS administrator privileges.

  2. Execute the rcxadm mgrctl snap -full or rcxadm agtctl snap -full command.
    Note that the command executed depends on the server where the resources are collected.

    When collecting on a master server

    # /opt/FJSVrcvmr/bin/rcxadm mgrctl snap -full [-dir directory] <Enter>

    When collecting on a slave server

    # /opt/FJSVrcxat/bin/rcxadm agtctl snap -full [-dir directory] <Enter>

3.Send the information collected to Fujitsu technical support.

Refer to "rcxadm agtctl" or "rcxadm mgrctl" in the "ServerView Resource Orchestrator Virtual Edition V3.0.0 Command Reference" for details.


6.1.2.3.3 If DFS setup or shared disk problems occurred

This section explains how to collect trouble shooting data when there is a problem setting up a DFS or in the shared disc.

Refer to "4.6.2 Collecting DFS Troubleshooting Information" in the "Primesoft Distributed File System for Hadoop V1.0 User's Guide" for details.

Collecting DFS trouble shooting data

When requesting investigation by Fujitsu technical support as part of the action taken in response to an output message, login using root permissions and collect the following resources.

Collect the resources in the state that is as close as possible to the state when the phenomena occurred.

In the information collected after the phenomena has ended or the system has been restarted, the state of the system has changed, and this may make investigation impossible.

  1. Output results of the resource collection tool (pdfssnap and fjsnap)

  2. Crash dump

  3. Execution results of the pdfsck command

  4. Collecting the core image for the daemon


When it is necessary to send the trouble shooting data quickly, collect the following as initial trouble shooting data.

  1. Output results of the resource collection tool (pdfssnap)

  2. /var/log/messages*


After collecting the resources for initial investigation, make sure the other resources are also collected.


Output results of the resource collection tool (pdfssnap and fjsnap)

Use pdfssnap.sh and the fjsnap command to collect the trouble shooting data.

Collect from all servers that shared the DFS, as far as possible.

Executing pdfssnap.sh
# /etc/opt/FJSVpdfs/bin/pdfssnap.sh <Enter>

Note

With pdfssnap.sh, the trouble shooting data is output to the directory where the command is executed. For this reason, at least 100MB of free space must be available in the file system that will execute the command.

Executing fjsnap
# /opt/FJSVsnap/bin/fjsnap -a Any file name <Enter>

Collecting the crash dump

If there was a panic on the server, for example, also collect the crash dump file as part of the trouble shooting data.

This is normally saved in a folder named "/var/crash/time of panic" when the server is started after a panic. Collect on all servers where a system panic has occurred.


Execution results of the pdfsck command

Collect if there is a mismatch in the DFS and it needs to be restored.

# pdfsck -N -o nolog Block-particular file for the representative partition <Enter>

Collecting the core image for the daemon

As part of the actions in response to DFS error messages, it may be necessary to collect core images as they relate to the various daemons.

Collect core images on all of the DFS admin servers.

The procedure is explained below using the example of collecting the core image of the pdfsfrmd daemon.

  1. Determining process ID

    Identify the process ID using the ps command. Change the argument of the grep command if the target is other than the pdfsfrmd daemon.

    # /bin/ps -e | /bin/grep pdfsfrmd <Enter>
    5639 ? 00:00:25 pdfsfrmd

    The beginning of the output is the process ID of the pdfsfrmd daemon. This is not output if the pdfsfrmd daemon is not running. Collect on another server if it is not operating.

    Information

    When collecting the MDS core image, specify pdfsmg in the argument of the grep command.

    See

    Refer to the online help for information on the ps and grep commands.

  2. Getting the core image

    Collect the core image of pdfsfrmd to the /var/tmp/pdfsfrmd_node1.5639 file using the gcore command. After that, compress the file with the tar command.

    # /usr/bin/gcore -o /var/tmp/pdfsfrmd_node1 5639 <Enter>
    gcore: /var/tmp/pdfsfrmd_node1.5639 dumped
    # /bin/tar czvf /var/tmp/pdfsfrmd_node1.5639.tar.gz
    /var/tmp/pdfsfrmd_node1.5639 <Enter>
    # /bin/ls -l /var/tmp/pdfsfrmd_node1.5639.tar.gz <Enter>
    -rw-rw-r-- 1 root other 1075577 June 12 16:30 /var/tmp/
    pdfsfrmd_node1.5639.tar.gz

    See

    Refer to the online help for information on the tar command.


6.1.2.3.4 When a problem occurs in Hadoop

Execute "/opt/FJSVbdpp/products/HADOOP/bin/HADOOP-collect.sh" and collect "collectinfo.tar.gz" output to the current directory.

Format

HADOOP-collect.sh --servers servername[,servername]

Options

--servers servername[,servername]

Specify the host names of the master server (primary), master server (secondary), slave server and development server, separated by commas. Do not use spaces.

Privilege Required/Execution Environment
Privileges

Operating system administrator

Execution environment

Master server

Example

# /opt/FJSVbdpp/products/HADOOP/bin/HADOOP-collect.sh --servers=master1,master2,slave1,slave2,slave3,slave4,slave5,develop <Enter>

Note

There will be a request for a password during execution if settings have not been made to allow SSH connection with the root user of the source of the information with the destination server for the information.

To reduce the workload by avoiding password input, distribute the public key of the SSH of the root user of the master server to the slave server and development server so that SSH without a password can be performed. Refer to the help for the ssh-keygen command for details.