If any of the errors listed below occur on the master server during execution of jobs, those jobs will be interrupted.
Tasks may be stopped for a long time until the master server has recovered.
System error (physical machine)
System error (virtual machine)
Public LAN network error
Cluster interconnect (CIP) error
iSCSI network error
JobTracker error
The following section explains the corrective action to take after an error occurs on the master server.
Figure 15.2 Procedure to resume tasks after an error occurs in a non-replicated configuration
Refer to the system log of the master server, and remove the cause of the error.
If a serious error has occurred on the master server requiring a server to be rebuilt, recover the master server.
The restore feature of this product can be used to rebuild and reconfigure the system configuration and the master server definition information to the normal running status.
Refer to "14.2.1.1 Restoring a Master Server, Development Server, or Collaboration Server" for information on the procedure to restore the master server.
Point
Prior to performing a restore, a backup of the master server must be created when it is running normally.
Refer to "14.1.2.1 Backing Up a Master Server, Development Server, or Collaboration Server" for information on the procedure to back up the master server.
Restart the master server that was recovered. When restarting the master server, the DFS must be unmounted and remounted on the DFS client (slave servers, development servers, and collaboration servers).
Use the following procedure to restart the master server:
Unmount the DFS on all slave servers, development servers, and collaboration servers.
Example
If the logical file system name of the DFS is pdfs1:
# umount pdfs1 <Enter>
Restart the master server.
If the master server is configured so that the DFS is not mounted automatically when the master server is started, restart it and then mount the DFS manually.
Mount the DFS on all slave servers, development servers, and collaboration servers.
Example
If the logical file system name of the DFS is pdfs1:
# mount pdfs1 <Enter>
Start Hadoop on the master server that was recovered.
Always use the bdpp_start command to start Hadoop.
After the master server has fully recovered, execute jobs as required and resume tasks.