Top
Systemwalker Operation Manager Troubleshooting Guide
FUJITSU Software

5.3.7 The Job is not Executed, and a Delay Occurs

Applicable versions and levels

Check all applicable actions below to resolve the issue.

Action 1

Cause

The limit for the number of jobs that can be executed simultaneously has been reached. The job waits until the number falls below the limit (this can be specified for the entire system, or per individual queue) before being executed.

Action method

Review the limit set for the number of jobs that can be executed simultaneously.

Check the limit set for the entire system in the following window:

Check the limit set per individual queue in the following window:

Additionally, consider the following operations:

Action 2

Cause

The queue has stopped.

Action method

The job can be executed by starting the job execution queue.

Check the queue status by displaying the View Queue Status/Operate window or by submitting the qstat command. Then, to start the queue that has stopped, use the start operation on the window, or use the qstart command.

Action 3

Cause

If the version level is V5.0L20/5.1 or earlier, or Disable simultaneous execution of jobs with an identical name is selected in a version level of V5.0L30/5.2 or later, multiple jobs with the same name are not executed at the same time. The execution of a job requested later does not start until the execution of the job that was requested first completes, therefore a delay occurs. (However, in a subsystem environment, there is a control per individual subsystem, therefore multiple jobs with the same job name that is used across subsystems are executed at the same time.)

Action method

Change the job name so that multiple jobs do not have the same job name.

To execute jobs with the same job name at the same time, take one of the following actions (V5.0L20/5.1 or earlier version levels cannot be used for this):

To enable the definition, restart the Job Execution Control service/daemon.

Information

If the "Job Name (J)" field is not entered, the directory path name from the path name specified in "Command Name (C)" and the name except for the extension part form the job name.

Action 4

Cause

This job is trying to allocate the resource already allocated to the job that is currently being executed. The job in the queue will not be executed until the job that is currently being executed and is using the same resource completes.

Action method

Display the View Job Status/Operate window or submit the qjstat command to check jobs with an executing status, and jobs that have not been executed. Then, for each job, compare the resource definitions in Resource name in the Standard information tab of the Monitor - Job window.

Action 5 (UNIX version V17.0.0 or later)

Cause

In the distributed execution function, the host that fails to connect is returned to the configured host in a fixed period of five minutes (five minutes after determining that the host that failed to connect is down).

With this 5-minute return time, the longer the connection timeout period is, for example, during host down or reboot, the more likely the reconnect will fail. If a reconnect fails and a subsequent attempt to connect to a subsequent host fails, the attempt is repeated only among the hosts that are down and not submitted to another host that can run the job.

In the following example, the timeout period for a connection is set to 5 minutes (300 seconds), and the timeout period is set to 4 minutes (the default value for Solaris) in the OS settings, as described in Action 7 of "5.3.2 Failed to Execute the Network Job (Error Message:MJS881S is Output)".

In this example, configuration host 1 - 3 fails to connect because it is down, but configuration host 1 returns for 5 minutes before it reaches configuration host 4. Therefore, it will attempt to reconnect to configuration host 1, which has returned.

As a result, reconnections are repeated among the down configuration hosts 1 - 3, and configuration host 4 does not check connectivity and jobs do not run.

Note)
If the OS setting value is smaller than the connection timeout value set by Action 7 of "5.3.2 Failed to Execute the Network Job (Error Message:MJS881S is Output)", the operation times out with the OS setting value. Therefore, in this example, the timeout occurs in 4 minutes.

Action method

Change the return time, which is normally 5 minutes, so that connections to more hosts can be attempted to avoid repeated reconnections between down hosts.

You can specify a return time between 5 and 60 minutes.

Here is an example solution.

In this example, the resume time is 15 minutes (900 seconds).
Even if the connection processing of configuration host 1 - 3 fails, the total timeout value is 12 minutes, so the connection of configuration host 4 is verified and the job is executed before the recovery time of configuration host 1 elapses of 15 minutes.

Note)
If the OS setting value is smaller than the connection timeout value set by Action 7 of "5.3.2 Failed to Execute the Network Job (Error Message:MJS881S is Output)", the operation times out with the OS setting value. Therefore, in this example, the timeout occurs in 4 minutes.

How to set

Set on the input server.

This definition takes effect in the distributed execution destination determination processing performed on the specified input server.

Here's how to set them.

For subsystem operation, define it for each subsystem.

Use an editor such as vi to create the definition file and set the values.

Make sure all users have read rights to the created file.

On a clustered system, the following definition files are located on the shared disk. You do not need to create a definition file for each node. Create a definition file on a primary node that can access the shared disk.

Definition file name

No subsystem operation - For subsystem 0

/etc/mjes/mjconf.ini

For subsystem 1 - 9

/etc/mjes/mjesN/mjconf.ini

N: 1 - 9

File format

[Disthost]

Revival=nnnn

[Dishost]

To use this function, specify this section.
If omitted, the return time is 300 seconds (the default value).

[Revival=nnnn]

Specify nnnn as the number of seconds to wait before resuming when a connection failure occurs on the distribution-destination configuration host. Specify this setting in the range of 300 - 3600 (seconds).
This key is optional. If this option is omitted, the definition file is not created, or the specified file format is invalid, the default value is 300 (seconds).

Notes

Do not include unformatted characters such as tabs or spaces between section and key names.

Do not create a file in any other format, such as omitting the value for the key name.

All configured host groups are affected when this function is set.

If each group has a different number of configured hosts, use the information from the configured host group that has the most registered hosts for the number of hosts in the formula shown in "How to estimate settings".

If a definition file with the same name already exists with a definition different from this function, the definition can be mixed with another definition.

Setting examples

Example of setting the definition file

[Disthost]
Revival=1800

Example of setting read permissions

# chmod 444 /etc/mjes/mjconf.ini

When the setting takes effect

The settings take effect for jobs that are executed after the mjconf.ini definition file is created.

You do not need to stop the Systemwalker Operation Manager daemon.

Unset Settings

Deletes the set key "Revival".

How to estimate settings

The return time to the configured host is calculated using the following formula.

If the calculated result is less than or equal to 300 seconds, no change is required.

(Number of hosts registered as configuration hosts -1) * n seconds

n:

If the OS setting value is smaller than the connection timeout value set by Action 7 of "5.3.2 Failed to Execute the Network Job (Error Message:MJS881S is Output)", the times out with the OS setting value.

Example) Maximum number of configuration hosts that can be registered (10 hosts)

If n is 34 seconds or more, a value larger than the calculated value is set as the return time.

If the configured host is a 10 host, the formula is as follows, so the minimum number of seconds over 300 is 34 seconds. Therefore, if n is less than or equal to 33 seconds, no changes are required.

(10-1) * 34 = 306

If n is 100 seconds, the formula is as follows, set "Revival = 900".

(10-1) * 100 = 900