ETERNUS SF Storage Cruiser User's Guide 13.2 - Solaris (TM) Operating System / Linux / Microsoft(R) Windows(R) -
Contents Index PreviousNext

Chapter 7 Performance Management

7.1 Overview

This product supports performance management functionality for fibre channel switch and ETERNUS disk array devices. This functionality enables users to get details about the operation and load statuses of devices. However, this product does not support performance management functionality in relation to the ETERNUS disk array device main frame volume, MVV and SDV.

The performance information can be referenced using Systemwalker Service Quality Coordinator. However, some performance information is not supported. For details, refer to the Systemwalker Service Quality Coordinator manual.

For details about supported devices, refer to "1.3.5 Support levels".

7.1.1 Performance Information Types

The following information can be managed: Performance monitoring can be set at different intervals, refer to "7.2.3 Setting Monitoring Intervals" for the settings available for each device.

  1. Fibre Channel switch

    Performance information (Unit) Fibre Channel switch
    Port Transfer rates of send/receive data (MB/S) supported
    Number of CRC errors supported

  2. ETERNUS disk array device

    Performance information (Unit) ETERNUS8000
    ETERNUS4000 (Except ETERNUS4000 model 80/100)
    ETERNUS6000 ETERNUS4000 model 80/100,
    ETERNUS3000 (Except model 50),
    ETERNUS GR series
    (ETERNUS GR720 or higher)
    ETERNUS2000
    LUN
    LogicalVolume
    RAIDGroup
    Read count and write count (IOPS / IO per second) supported supported supported supported
    Read and write data transfer rate (MB/S) supported supported supported supported
    Average response time for read and write (msec) supported supported supported supported
    Read, pre-fetch, and write cache hit rate (%) supported supported supported supported
    Disk drive Disk busy rate (%) supported supported supported supported
    CM Load (CPU usage) rate (%) supported supported supported supported
    Copy remaining amount (GB) supported supported not supported supported
    CA Load factor (CPU usage rate) (%) not supported supported not supported not supported
    Read count and write count (IOPS / IO per second) supported supported not supported not supported
    Read and write data transfer rate (MB/S) supported supported not supported not supported
    CM Port Read count and write count (IOPS / IO per second) not supported not supported not supported supported
    Read and write data transfer rate(MB/S) not supported not supported not supported supported
    DA load factor (CPU usage rate) (%) not supported supported not supported not supported
    Read count and write count (IOPS / IO per second) not supported supported not supported not supported
    Read and write data transfer rate (MB/S) not supported supported not supported not supported

7.1.2 Performance Graph Window Types

This software product also provides graph windows with the following time units:

  1. One-hour Graph window

    Based on the time selected as the performance monitoring interval, a line graph for a one-hour period is displayed.

    Examples are as follows:

    If the performance monitoring interval is 30 seconds, a line graph for a one-hour period with values obtained at an interval of 30 seconds is displayed.

    If the performance monitoring interval is 60 seconds, a line graph for a one-hour period with values obtained at an interval of one minute is displayed.

    If the performance monitoring interval is 300 seconds, a line graph for a one-hour period with values obtained at an interval of five minutes is displayed.

    If the performance monitoring interval is 600 seconds, a line graph for a one-hour period with values obtained at an interval of ten minutes is displayed.

    Values displayed on the graph are the mean value of the performance monitoring interval. However the CM copy remaining volume graph displays information obtained at the time the graph was produced.

  2. One-day Graph window

    Based on the mean values for 10-minute periods, a line graph for a one-day period is displayed.

  3. One-week Graph window

    Based on the mean values per hour, a line graph for a one-week period is displayed.

7.1.3 Threshold Monitoring Types

Threshold monitoring is supported for fibre channel switch and ETERNUS disk array devices.

The threshold monitoring functionality sends an alarm or report when a storage or switch performance value reaches a certain level (threshold value) under certain conditions in daily transaction operations.

The advantage of using the threshold monitoring is that the symptom of a storage or switch performance drop caused by changes in data processing rates and transaction processing rates can be automatically and reliably detected in daily transaction operations.

The effect that can be expected from the threshold monitoring functionality is that operations in the most suitable environment are enabled. This is accomplished by preventing the adverse effect of performance drops through early detection of bottleneck locations, identification of their causes, and improvement of device configuration.

The threshold monitoring functionality can manage the following information:

  1. Fibre Channel switch

    Port throughput (%)

    A port throughput value (MB/s) is monitored as the percentage (%) of an allowable tolerance to a maximum transfer capability (MB/s).

  2. ETERNUS disk array device

    Response time (msec) of LUN (OLU)

    Average use (busy) rate (%) of RAIDGroup (RLU, LUN_R)

    CM load (CPU usage) rate (%)

The threshold monitoring functionality provides the Threshold Monitoring Alarm Log and Condition Report windows.

From all devices monitored by the functionality, the Threshold Monitoring Alarm Log window displays a list of threshold monitoring alarm items detected on individual devices.

The Condition Report window provides the following four windows:

  1. Logical Volume response time error

    This is displayed if the Logical Volume response time of a monitored device is found to have reached the state specified in threshold settings. Guidelines for actions to be taken are also displayed.

  2. RAID Group load error

    This is displayed if the RAID Group utilization of a monitored device reaches the specified state. Guidelines for actions to be taken are also displayed.

  3. CM load error

    This is displayed if the CM load ratio of a monitored device reaches the specified state. Guidelines for actions to be taken are also displayed.

  4. Port throughput load error

    It is displayed when the transfer/receive usage rate of port of a monitored device reached the specified state. Measures indicator is also displayed.

7.2 Flow of Performance Management

When a user gives an instruction for performance management of a target device from a GUI window, the performance management unit issues SNMP Traps periodically through a LAN to devices to obtain performance information, and it saves the information as performance data on the administrative server. This software product displays the performance data in the Performance Management window and manages the device.

7.2.1 Checking disk space on the administrative server

To conduct performance monitoring, sufficient disk space is required on the administrative server for performance data storage. Make sure that sufficient disk space is ensured referring to the Installation Guide. This software product is capable of deleting performance data overdue the specified holding period. The default is seven days and data exceeded this period are automatically deleted. This period can be modified. To change the number of days to store performance data refer to "7.7 Definition File".

7.2.2 Instruction for performance management

To display the dialog for setting the monitoring state, click the target device in the SAN view of the resource view, and then select [Device(D)]-[Performance management(S)] from the menu or right-click [Performance management] from the popup menu.

ETERNUS disk array performance management settings window

Fibre Channel switch performance management settings window

In ETERNUS disk arrays, enter the minimum and maximum values for the performance information securement target Logical Volume (LUN_V). Setting LUN can reduce the effect of obtaining performance information on disk areas where performance data is saved and reduce the load. Setting LUN prevents allocation of more space than necessary for Logical Volume. Consequently, the recommended entry for Logical Volume is the minimum value for obtaining performance data.

If the device configuration has changed, update the device configuration information maintained by the performance management functionality. For details about the update procedure, refer to "7.2.11 Updating configuration information".

7.2.3 Setting monitoring intervals

Enter the interval at which performance information is secured in the ETERNUS disk array and Fibre Channel switch common settings. You can specify 5, 10, 30, 60, 300, or 600 seconds as the interval. However, intervals that can be specified vary depending on device models and the number of Logical Volumes indicating performance level being maintained.

Monitoring condition

Specifiable interval

Device model name

Number of LogicalVolumes whose performance is maintained

ETERNUS4000(M80,100)
ETERNUS3000
ETERNUS2000
GR740,820,840
GR720,730

128 or less

5/10/30/60/300 seconds

129 to 2,047

30/60/300 seconds

2,048 or more

60/300 seconds

ETERNUS6000

64 or less

10/30/60/300 seconds

65 to 2,047

30/60/300 seconds

2,048 or more

60/300 seconds

ETERNUS8000

ETERNUS4000(Except for M80,100)

256 or less

30/60/300/600 seconds

257 to 1024

60/300/600 seconds

1025 to 8192

300/600 seconds

8193 or more

600 seconds

ETERNUS SN200
ETERNUS SN200 MDS

-

5/10/30/60 seconds


When you click [Start] in this dialog, an instruction to obtain performance information is issued to the performance management unit (see the figure in "Flow of Performance Management"), and the performance management unit obtains performance information of the device through the LAN and saves it as performance data. Since the performance management unit is started as a daemon of the administrative server, the unit continues obtaining performance information while the administrative server is active, even if no GUI window is displayed.

The logical configuration of the storage device is recognized, and the obtaining of performance information starts. At the start of obtaining performance information of the selected storage system, an amount of time (tens of seconds to several minutes) is spent to obtain the logical configuration before any performance information is actually obtained.

When performance monitoring starts, the "P" mark appears in green at the upper left of the device icon on the map display. If the Performance Management window is open and the corresponding Fibre Channel switch and storage system are displayed in the tree, the device name is displayed in the same color as that of the "P" mark.

The table below lists "P" mark colors and their corresponding statuses and actions to be taken. The color may differ from that of the current status. Click [Refresh] on the GUI window or press the F5 key to check the latest status.

"P" mark color

Status

Appropriate action

Green

Performance is being monitored. (Normal)

Performance is being monitored.

Yellow

Performance monitoring is being recovered (e.g. device time-out).

The administrative server cannot communicate with a device. Check the network status and device status. Log off, when ETERNUSmgr/GRmgr is in a login state.

Red

Writing to the performance information file failed.

Check the write permission to the file and the capacity of the file system.

Wrong registered password of GR

Reregister devices with this software product, and restart performance monitoring.

Internal error

Contact a FUJITSU maintenance engineer.


 

7.2.4 Starting the performance management window

Start the Performance Management Window to display performance information. To open the Performance Management window, select [File]> [Performance Management Window] from the GUI menu or right-click [Performance Management Window] from the popup menu.

7.2.5 Displaying performance information of the selected device

You can display the performance information of a device in the Performance Management window by dragging a device icon displayed in the resource view and dropping it in the Performance Management window. You can create multiple Performance Management windows. Also, multiple device icons can be dropped in a single Performance Management window to display information about those devices.

7.2.6 Displaying Fibre Channel switch performance information

From the device tree view in the Performance Management window, select and right-click the port number of the Fibre Channel switch whose performance information you want to display, and then select [Show Performance Graph] from the resulting popup menu.

The dialog shown below appears. In the dialog, select the items to be displayed.

Time Axis

Select the time interval for a graph that you want to display. Select one hour, one day, or one week.
Refer to "7.1.2 Performance Graph Window Types" for details.

Date Specification

Specify the date and time to be displayed the center of the graph. The current time is displayed. You may select the date and time of a graph that you want to display. A period of up to 7 days can be specified.

Throughput

Displays a data transfer rate (MB/S). Here, select an information type. Select one of the following three: (1) displaying the graphic window for the transmitting-side performance of a port, (2) displaying the graphic window for the receiving-side performance, and (3) displaying one graphic window where both transmission and reception performances are displayed at the same time. The transmission and reception performances can be selected at the same time but, if the combined transmission-reception performances are selected, neither the individual transmission performance nor the reception performance can be selected.

Error

Displays the graph of a CRC error count that occurs in the port.

The window shown below displays a result of selections in the above dialog. One-hour Graph windows of send throughput and receive throughput are displayed. Using the graphs, you can determine the operating status of the port. See "B.10.5 Graph window functions".

To display performance information of multiple ports at the same time, click multiple ports in the tree while holding down the Ctrl or Shift key, and then right-click a selected port to display a graph.

In this case, "Open Window for Every Port" is added to the dialog displayed immediately before a graph is displayed. If you check the check box, one window for each port opens.

If you do not check the check box, you can select "Total of Throughput" in the dialog. If you select it, the total values of the ports are displayed in a graph. Otherwise, values of each port are displayed in the same graph window. Incidentally, if "Send/Receive" is selected, "Total of Throughput" must be selected.

The window shown below is an example where "Open a window for each port" and "Total of Throughput" are not selected. To check the correspondence between ports and lines in the graph in the window, select a button for a line in the graph. In this example, port 14 is frequently used.

7.2.7 Displaying storage performance information

When the ETERNUS disk array device icon is dragged and dropped to the performance management window, a storage logic configuration tree will be displayed as below.

"AffinityGroup" indicates a number of the zone functionality of the selected storage system.

"LUN" indicates a logical unit number from the point of view of the server node. Since this is allocated with Logical Volume (OLU and LUN_V) that specifies a number unique to a device managed in the device, this is expressed as "LUN X(Logical Volume X)" in the tree.

"RAID Group" located under "LUN" indicates that LUN is included in "RAID Group" (rank). [Disk](=physical drive) under [RAIDGroup] or [RAIDGroup [X- X]] indicates the drive used to configure the rank. [LogicalVolume] under [RAIDGroup] or [RAIDGroup [X- X]] indicates the numbers of other LogicalVolumes that belong to the same RAIDGroup. [RAIDGroup X- X] also has devices that are not shown.

The properties are displayed as tool tips. For details about items that can be checked in these tool tips, refer to "B.10.3 The tree view".

Figures beginning with "0x" are values expressed in hexadecimal notation. Other numbers are decimal numbers.

7.2.7.1 Displaying LUN and RAIDGroup performance information

From the device tree in the Performance Management window, select the number of the LUN or RAID Group whose performance information you want to display, right-click to display a popup menu, and select [Show Performance Graph].

You can select multiple numbers. To do so, click LUN or RAID Group while holding down the Ctrl or Shift key, right-click and slect [Show Performance Graph].

The dialog shown below appears. In the dialog, select the graph window to be displayed.

Time Axis

Select the time interval for a graph that you want to display. Select one hour, one day, or one week.
Refer to "7.1.2 Performance Graph Window Types" for details.

Date Specification

Specify the date and time to be displayed in the center of the graph. The current time is displayed. You may select the date and time of a graph that you want to display. A period of up to 7 days can be specified.

IOPS

Indicates how many times I/O is issued per second.

Throughput

Displays a data transfer rate (MB/S).

Response time

Displays an average I/O processing time (ms).

Cache hit rate

Displays a ratio (%) at which cache is hit.

* For the IOPS, throughput, and response time, one of the following three can be selected: (1) displaying a READ graphic window, (2) displaying a Write graphic window, and (3) displaying one graphic window where R/W (Read and Write information) items are displayed at the same time. Read and Write can be selected at the same time but, if R/W is selected, the individual Read and Write graphic windows cannot be selected.

* For the cache hit ratio, one of the following four can be selected: (1) displaying a Read hit-ratio graphic window, (2) displaying a Write hit-ratio graphic window, (3) displaying a pre-fetch hit-ratio graphic window, and (4) displaying one graphic window where all R/W/P information (Read, Write, and Pre-fetch hit ratios) is displayed at the same time. Read, Write and pre-fetch can be selected at the same time but, if R/W/P is selected, the individual Read, Write, and pre-fetch graphic windows cannot be selected.

If multiple logical units are selected be displayed on a graph, "Open Window for Every LUN" is displayed. Select it to open one graph window for each LUN.

If it has not been selected, "Total" is displayed in the dialog. If you select "Total," the "Total" graph appears. Otherwise, the information about multiple units is displayed in the same graph window. If "R/W/P" or "R/W" is selected, "Total" must be selected.

7.2.7.2 Displaying disk (physical drive) performance information

From the device tree view in the Performance Management window, select the number of the disk whose performance you want to display, right-click to display a popup menu, and select [Show Performance Graph].

You can select multiple disks. To select multiple disks, click multiple disks while holding down the Ctrl or Shift key, right-click and select [Show Performance Graph].

The dialog shown below appears. In the dialog, select the graph window to be displayed.

Time Axis

Select the time interval for a graph that you want to display. Select one hour, one day, or one week.

Refer to "7.1.2 Performance Graph Window Types" for details.

Date Specification

Specify the date and time to be displayed in the center of the graph. The current time is displayed. You may select the date and time of a graph that you want to display. A period of up to 7 days can be specified.


If multiple logical units are specified for displaying a graph, "Open Window for Every Disk" is displayed in the dialog. If you select it, one graph window opens for each disk. Otherwise, the information about multiple disks is displayed in the same graph window.

7.2.7.3 Module performance view

To display the performance graph, select the module (CM, CA, CMPort, DA) from the performance management window and use a right mouse click to display the pop up menu, and then click on [performance graph display]. Multiple modules can be selected by holding down the [Ctrl] key or the [Shift] key while clicking on the modules. When DA is selected the DA Performance Graph dialog shown below is displayed and when CA or CM Port is selected their respective performance graph dialogs will display.

From the performance graph dialog, select the options for the particular graph you wish to be displayed in the graph window.

Time Axis

Select the time interval for a graph that you want to display. Select one hour, one day, or one week.
Refer to "7.1.2 Performance Graph Window Types" for details.

Date Specification

Specify the date and time to be displayed in the center of the graph. The current time is displayed. You may select the date and time of a graph that you want to display. A period of up to 7 days can be specified

CPU

Displays the CPU usage (%) of DA or CA.

IOPS

Displays the number of I/O issued per second of DA, CA port, or CM Port.

Throughput

Displays the data transfer volume (MB/S) of DA, CA port, or CM Port.

When CM is selected the following CM Performance Graph will display.

A chart window can be selected on this dialog.

Time Axis

Select the time interval for a graph that you want to display. Select one hour, one day, or one week. Refer to "7.1.2 Performance Graph Window Types" for details.

Date Specification

Specify the date and time to be displayed in the center of the graph. The current time is displayed. You may select the date and time of a graph that you want to display. A period of up to 7 days can be specified.

Graph

Load

Displays the CPU usage (%) of CM module.

Copy Residual Quantity

Displays the remaining copy volume (GB) of advanced copy (EC/OPC). When both EC and OPC are operating, a total of the remaining copy volumes of EC and OPC is displayed.


Selecting "Open Window for Every Port" and "Open Window for Every CM" on the dialog when multiple items are selected displays chart windows for respective modules.

7.2.8 Operating graph windows

If the amount of performance data is large (in particular, when the ETERNUS disk array device RAIDGroup or multiple selections are chosen), or if load on a LAN is heavy, a long time may be required to display a graph after the [Previous Hour] or [Next Hour] button is clicked. In such cases, right-click the mouse on the graph window to open a popup menu. The popup menu has a command for opening a graph window from which the time range of a graph can be changed. Select the command, and open the One-day Graph window from the graph window so that hour-by-hour information is displayed, move the cursor in the One-day Graph window to the time that you want to check, right-click to display a popup menu, and select [One-hour Graph Window] for a smooth transition to a graph centered on this time.

To display the maximum value graph, click the [Peak] button in the One-day Graph window, or One-week Graph window. You can then move the cursor to the time of the maximum value and right-click to display a popup menu, enabling a smooth transition to a graph centered on this time of the maximum value in the same way as described above.

Refer to "B.10.5 Graph window functions".

7.2.9 Examples of use of performance management

If an I/O delay from the server node to a storage system occurs, the user can check for the cause in the storage system by using the methods described below. These are only examples, so all causes of I/O delays cannot be determined by use of these methods.

  1. Identify the time when the I/O processing delay occurred and the access path where the delay occurred.

  2. Use this product to check the AffinityGroup number and LUN number of the ETERNUS disk array defined in the target access path.

  3. Using performance management, display and check the target LUN performance values.

  4. If a response of the LUN unit takes a long time, check RAID Group performance. If a response of RAID Group also takes a long time, find another Logical Volume belonging to RAID Group, and find the LUNs to which the Logical Volume is allocated. Check the I/O statuses of these LUNs, and check for a heavy load on RAID Group. If there is a heavy load, move the appropriate LUN to another RAID Group, or take other appropriate action.

7.2.10 Instruction for stopping performance management

Click the target device on the GUI window. Select [Device]-[Performance management(S)] from the menu, or right-click and select [Performance management] from the resulting popup menu. Then, select [Stop] in the window for setting the monitoring status.

7.2.11 Updating configuration information

Device configuration information is independently maintained in the performance management functionality.

To change the device configuration, update the device configuration information that is maintained by the performance management functionality according to the procedure shown below. Also perform the update if the configuration for a device used to execute performance monitoring and threshold monitoring has been changed.

If the configuration for a device used to execute performance monitoring and threshold monitoring has been changed, the configuration information before the update is used for performance monitoring and threshold monitoring. Performance information and threshold monitoring, shown in the procedure below, cannot be guaranteed prior to the configuration information update.

<Configuration information update procedure>

  1. Record the performance monitoring settings contents (if performance monitoring is used)

    <Recorded settings contents>

  2. Record the threshold monitoring settings contents (if threshold monitoring is used)

    <Recorded settings contents>

  3. Stop threshold monitoring (if threshold monitoring is used)

    Refer to "7.3.7 Instruction for stopping threshold monitoring".

  4. Stop performance monitoring (if performance monitoring is used)

    Refer to "7.2.10 Instruction for stopping performance management".

  5. The composition of the device is changed.

  6. In the menu bar of the Performance Management window, click [Device] > [Create Device Configuration].

  7. Start performance monitoring based on the settings contents recorded in 1. (if performance monitoring is used).

    Refer to "7.2.2 Instruction for performance management" and "7.2.3 Setting monitoring intervals".

  8. Start threshold monitoring based on the settings contents recorded in 2. (if threshold monitoring is used).

    Refer to "7.3.3 Setting the threshold monitoring hours" and "7.3.4 Setting the threshold monitoring information".

7.2.12 Performance data 

Performance data is saved in CSV files in the following directory of the administrative server:

[Solaris OS version of Manager] /var/opt/FJSVssmgr/current/perf/

[Linux version of Manager] /var/opt/FJSVssmgr/current/perf/

[Windows version of Manager] administrarive-server-work-directory\Manager\var\opt\FJSVssmgr\current\perf

You can save these files by saving the entire directory as necessary, and you can display old information as necessary by recovering it in the same format.

However, the automatic deletion functionality operates in the performance data. When restoring, execute it after confirming days of the data retention duration. For the data retention duration, refer to "D.4 perf.conf Parameter".

Example: (Solaris OS)

7.3 Flow of Threshold Monitoring

When a user uses the Performance Management window to issue an instruction for threshold monitoring of the devices subject to performance management, the performance management unit of this software product periodically issues SNMP Traps through the LAN to the devices to obtain device performance information. The threshold-monitoring unit then sequentially analyzes the performance information.

If a problem is detected from the performance information, it is displayed as an alarm in the event log in the SAN Management window and displayed in the Threshold Monitoring Alarm Notification Log window of the Performance Management window.

7.3.1 Checking disk space on the administrative server

For threshold monitoring, users must prepare disk space on the administrative server to save condition reports of the threshold monitoring alarm notification log. About 4 MB is required. Make sure that the server has sufficient disk space. This software product has a functionality for deleting condition report data whose set retention duration days have passed. Data stored for 366 days or more is automatically deleted. Change this retention duration setting as required. For details, see Alarm Delete Setting in the "B.10.6 Threshold monitoring dialog functionality".

7.3.2 Instruction for threshold management

Select the device name tree node from the GUI window Performance Management window tree, and select [Threshold Monitoring] from the menu. Then, a variety of threshold monitoring setup menus appear. To use the menus, however, performance information of the device must have already been obtained by performance management.

Select [Monitoring Enable/Disable(E)]. This enables threshold monitoring, and you can set up different kinds of threshold monitoring.

7.3.3 Setting the threshold monitoring hours

To set a time period for threshold monitoring, select [Monitoring Time Setting]. If no time period is set, thresholds are monitored and alarms are reported for all time periods. Large volumes of threshold monitoring alarm logs may be reported depending on threshold settings. Users are recommended to set a time period if performance is a concern in a system environment where load varies considerably depending on how to use a target device.

7.3.4 Setting the threshold monitoring information

Next, select [Threshold Setting/Start Monitoring/Stop Monitoring(S)], define threshold setting information of the target device, and issue an instruction to start monitoring. Threshold monitoring then starts. Incidentally, since the threshold monitoring unit is started as a daemon of the administrative server along with the performance management unit, the threshold monitoring unit continues threshold monitoring while the administrative server is active, even if no GUI window is displayed. Moreover, refer to "B.10.6 Threshold monitoring dialog functionality" for the setting of threshold setting information.

7.3.5 Displaying threshold monitoring alarm logs

To display threshold monitoring alarm logs, open the Performance Management window, select [Threshold monitoring] from the menu bar, and select [Thresholds Alarm Log] from the displayed list. The displayed list is a list of logs of alarms detected by threshold monitoring. To open the Performance Management window, select [File(F)]-[Performance Management Window(S)] from the GUI menu.

7.3.6 Displaying condition reports

Details of the list displayed by [Thresholds Alarm Log] are displayed. Users can determine the appropriate actions and guidelines to take for different threshold monitoring alarms in the displayed report logs. To display the details, move the cursor to the report line of the threshold monitoring alarm log to be referred to, and double-click on the line.

7.3.7 Instruction for stopping threshold monitoring

From the Performance Management window menu in the GUI window, select [Threshold Monitoring(T)]-[Threshold Setting/Start Monitoring/Stop Monitoring(S)]. To end threshold monitoring, click the [Stop] button in the window displayed for setting a threshold.

7.4 Evaluation Criteria for Thresholds in Threshold Monitoring

Storage thresholds

Standard storage thresholds are listed below.

 

Online response-oriented system

Batch throughput-oriented system

LogicalVolume Response

30 ms or less

-

RAIDGroup Busy Rate

60% or less

80% or less

CM Busy Rate

80% or less

90% or less


For a response-oriented system such as for online applications, implementing LogicalVolume responses within 30 ms is a standard for stress-free storage operation. To implement responses within 30 ms, suppress the RAIDGroup busy rate to 60% or less and the CM busy rate to 80% or less.

In a throughput-oriented system such as for batch applications, LogicalVolume responses are extended to a few milliseconds because sequential access increases the cache hit ratios. However, the cache hit ratios are greatly affected by application access and the values change a great deal. As a result, responses may vary from a few milliseconds to 50 ms or more. Thus, for batch applications, there is no standard threshold for LogicalVolume responses.

To improve throughputs for batch applications, users must optimize the use of storage resources. Note, however, that performance may deteriorate rapidly if the above threshold is exceeded. For this reason, make it a standard to suppress the RAIDGroup busy rate to 80% or less and the CM busy rate to 90% or less.

During hours in which advanced copy is processing, advanced copy processing itself increases the CM busy rate.
For this case, set thresholds in consideration of execution of advanced copy.

7.5 Examples of Threshold Monitoring

This section provides an overview of threshold monitoring in the form of key examples to enable users to determine what situations require thresholds and the types of thresholds that should be set for them.

Case 1: Online application system at company A

+Material 1: System operation standard and performance requirements (excerpt)

  1. Online application service hours: 8:00 to 18:00 everyday

  2. Online application busy hours: 12:00 to 15:00 everyday

  3. This system requires that operator terminal operation be stress-free even with workload during the busy hours.
    Therefore, the target performance of I/O response shall be "30 ms or less," which is a general standard.
    The target I/O response performance in hours other than the busy hours shall be "10 ms or less," one third of 30 ms, according to the workload proportion (the workload in the busy hours is about three times higher than that in other hours).

  4. During the busy hours, processing for data reference, updating, and addition may occur concurrently and continue for up to 60 minutes.
    If a state in which an I/O response taking 30 ms or more occurs for a period equivalent to 10% (6 minutes) of the said continuous execution, operation at the operator terminal may undergo stress. Therefore, make the settings so that an alarm log will be generated when such state occurs.

  5. If I/O responses during the busy hours come down to 10 ms or less, the same as the performance target in other hours, the I/O response delays that occurred previously shall be deemed as instantaneous symptoms.
    Therefore, an alarm log need not be generated when this state occurs.

  6. The event log need not be displayed every time an alarm log is generated but can be displayed only once a day.
    (This is because the system administrator checks the condition report once a day.)

+Illustration of operational status of company A's online application system (transition of LogicalVolume responses)

+An example of threshold monitoring setting for company A's online application system is shown below:

Number corresponding to material 1

Setting item

Setting

1

Threshold Monitoring Time

8:00-18:00

2

Alarm Display Time

12:00-15:00

3

Target

LogicalVolume Response

3

Threshold

30 ms

4

Threshold monitoring Interval

60 minutes

4

Alarm Tolerance Level

Total time: 360 seconds

5

Rearm

10 ms

6

Alarm Display Frequency

Day by day

Case 2: Online shopping system of company B

+Material 2: System operation standard and performance requirements (excerpt)

  1. Online application service hours: 24 hours a day for 365 days a year

  2. Online application busy hours: Cannot be specified.

  3. This system features that the number of accesses gradually increases as the number of member customers increases after start of the production run. It is assumed that the load on storage will also increase gradually. Measures need to be taken when the busy rate of storage resources (CM and disk) comes over 60% to 80%.

  4. This system executes credit card transactions every 5 minutes. Therefore, for five minutes immediately before each transaction, product retrieval and order processing must be executed without stress. If the storage resource is kept in busy state (a state in which the busy rate exceeds 60% to 80%) for five minutes, transactions may be affected. Therefore, make settings so that an alarm log will be generated when such state occurs.

  5. Event log shall be displayed every time an alarm log is generated. The system administrator checks the condition report when an event log is displayed.

+Illustration of operational status of company B's online shopping system (transition of CM busy rate)

+An example of threshold monitoring setting for the company B's online shopping system is shown below:

Number corresponding to material 2

Setting item

Setting

1

Threshold Monitoring Time

0:00-24:00

2

Alarm Display Time

0:00-24:00

3

Target

CM Busy Rate

3

Threshold

60%

4

Alarm Tolerance Level

Continuous time: 300 seconds

5

Alarm Display Frequency

All

 

Number corresponding to material 2

Setting item

Setting

1

Threshold Monitoring Time

0:00-24:00

2

Alarm Display Time

0:00-24:00

3

Target

RAIDGroup Busy Rate

3

Threshold

80%

4

Alarm Tolerance Level

Continuous time: 300 seconds

5

Alarm Display Frequency

All

Case 3: Batch processing with multiple database servers (cluster system) of company C

+Material 3: System operation standard and performance requirements

  1. System service hours: 24 hours a day, 365 days a year

  2. Batch processing hours: 20:00 to 23:00 every night

  3. This cluster system is an Oracle RAC system consisting of three nodes. There is no problem with the batch processing performance because the amount of processed data is currently small. As the amount of data increases in the future, however, we have concerns over bottlenecks in the performance of FC path transfer between the FC switch and storage.
    If an FC path bottleneck occurs, it must be eliminated quickly.

  4. Assume the state in which the port throughput reaches about 80% of the maximum transfer capability as an FC path bottleneck, and make settings so that an alarm log is generated when such state continues for 30 minutes or more.

  5. Event log need not be displayed every time an alarm log is generated but can be displayed only once even when an alarm log is generated more than once in the batch processing hours. The system administrator checks the condition report when an event log is displayed.

+Illustration of batch processing with multiple database servers (cluster system) at company C (transition of port throughputs)

+An example of threshold monitoring setting for business system backup operation at company C is shown below:

Number corresponding to material 3

Setting item

Setting

1

Threshold Monitoring Time

0:00-24:00

2

Alarm Display Time

20:00-23:00

3

Target

Port Throughput

3

Threshold

80%

4

Alarm Tolerance Level

Continuous time: 1,800 seconds

5

Alarm Display Frequency

Every monitoring time

 

7.6 Condition Report and Corrective Measures for Problems

7.6.1 Delay in LogicalVolume response

Report detail:
A delay in response of LogicalVolume YYYY defined in RAIDGroup XXXX was detected.

The monitoring states of other presumably related targets are as follows:
<Monitoring state of each CM> <- (1)
[CM0x00] High load state detected
[CM0x01] Not detected
[CM0x10] Not detected
[CM0x11] Not detected

RAIDGroup XXXX monitoring state: High load state detected <- (2)

The block size of I/O in which a response delay was detected is 8K bytes. <- (3)

Related graph:
Refer to the LogicalVolume YYYY response time graph.

Guidelines for corrective measure:
1. RAIDGroup is probably in the high load state. Check the alarm of the RAIDGroup busy rate (disk utilization) and refer to the guidelines for corrective measures.
2. CMCM is probably in high load state. Check the alarm of the CM busy rate and refer to the guidelines for corrective measures.
3. It is assumed that I/O processing takes time because the block size is large. Review the threshold.

(1)
Indicates the state of each CM during the same time zone as the LogicalVolume response delay is detected.

High load state detected

An alarm log indicating a CM load error is generated during the same time zone. If the CM in charge of the LogicalVolume in which a response delay was detected is in the high load state, a response delay due to a CM bottleneck is assumed. Refer to the guidelines for corrective measures for alarms for the relevant CM.

Monitoring

The CM is monitoring for alarm detection because the threshold has been exceeded several times although no events have been detected as an alarm.

Not detected

[When the CM busy rate is defined as a monitoring target]
No CM bottleneck has occurred during the same time zone.Check the RAIDGroup defining the LogicalVolume for alarms.

[When the CM busy rate is not defined as a monitoring target]
The CM busy rate is not monitored.
("Not detected" is displayed irrespective of the CM busy rate.)

(2)
Indicates the state of the RAIDGroup in which the relevant LogicalVolume is defined during the same time zone as the LogicalVolume response delay is detected.

High load state detected

An alarm log indicating a RAIDGroup load error is generated during the same time zone. A response delay due to bottlenecks of disks configuring a RAID is assumed. Refer to the guidelines for RAIDGroup corrective measures for alarms for the relevant RAIDGroup.

Monitoring

The CM is monitoring for alarm detection because the threshold has been exceeded several times although no events have been detected as an alarm.

Not detected

[When the RAIDGroup busy rate is defined as a monitoring target]
There is no bottleneck of disks comprising the RAID in the same time zone.

[When the RAIDGroup busy rate is not defined as a monitoring target]
The RAIDGroup busy rate is not monitored.
("Not detected" is displayed irrespective of the RAIDGroup busy rate.)

(3)
Indicates the I/O block size at the time of detection of a LogicalVolume delay response.
If the CM in charge of the relevant LogicalVolume or the defined RAIDGroup is not in the high load state, an alarm is not attributable to CM and disk bottlenecks but may be attributable to an unreasonably large I/O block size for the threshold.

For instance, when a response delay is detected while the threshold of the LogicalVolume response is set to 30 ms, no CM and disk bottlenecks have occurred, but the I/O block size may be 512K bytes. In this case, the most highly probable cause of the response delay is a large I/O block size. Generally, the larger the I/O block size, the larger the response delay. When the standard response is set to 30 ms, an I/O block size of 512K bytes or more is a rough indication for a response delay due to the I/O block size.
Take measures such as reviewing the threshold of the LogicalVolume response or reducing the I/O block size in the application.

7.6.2 RAIDGroup load error

Report detail:
RAIDGroup XXXX in high load state was detected.

Related graph:
Refer to the graph of the RAIDGroup XXXX busy rate (disk utilization).
Refer to the IOPS graph about each LogicalVolume in the RAIDGroup XXXX.

Guidelines for corrective measure:
1. I/O may be concentrated on the LogicalVolumes in the same RAIDGroup. <- (1)
Relocate the LogicalVolumes in the relevant RAIDGroup to other RAIDGroups (or newly created RAIDGroups) to distribute I/O.

(1)
Indicates the guidelines for corrective measures for disk bottlenecks.
Take measures such as distributing I/O loads by relocating the LogicalVolume data with the highest IOPS in the relevant RAIDGroup to a RAIDGroup with less disk utilization or a newly created RAIDGroup.

7.6.3 CM load error

Report detail:
CM XX in the high load state was detected.

The monitoring states of other presumably related targets are as follows:
<Monitoring state of each CM> <- (1)
[CM0x00] High load state detected
[CM0x01] Not detected
[CM0x10] Not detected
[CM0x11] Not detected

Related graph:
Refer to the CM utilization graph.
Refer to the IOPS graph of each LogicalVolume. <- (2)

Guidelines for corrective measure:
1. I/O may be concentrated on the RAIDGroups under control of the same CM <- (2)
By referring to the monitoring state of each CM, distribute I/O to RAIDGroups under control of a CM with a low load.
2. If the monitoring state of every CM is "Monitoring" or "High load state detected," the number of CMs is probably too small to meet every I/O request.
Consider adding or upgrading hardware.

(1)
Indicates the states of other CMs during the same time zone as the relevant CM detected a high load state.

High load state detected

An alarm log indicating a load error in the relevant CM is generated during the same time zone.

Monitoring

The CM is monitoring for alarm detection because the threshold has been exceeded several times although no events have been detected as an alarm.

Not detected

The relevant CM did not cause a bottleneck during the same time zone.

(2)
If only the relevant CM detected a high load state, I/O access is unevenly concentrated on the CM.
Refer to the IOPS graphs of RAIDGroups and LogicalVolumes and take measures such as distributing I/O loads of each CM.

7.6.4 Port throughput load error

Report detail:
Port X in the high load state was detected.
The maximum transfer rate of the relevant port is 1 Gbps.

Related graph:
Refer to the Port X throughput graph.

Guidelines for corrective measure:
1. I/O is probably concentrated on the same port.
Check the setting of the path of the relevant port, or consider adding a path switch. <- (1)


(1)
An I/O load is concentrated on the relevant port. Access is probably biased to the same port due to a change in logical paths or setting errors made during expansion. Examine the port load balance by referring to the send/receive graphs of all ports of the relevant switch.

7.7 Definition File

The following items can be set up in the set up file perf.conf for performance management.

Refer to "Appendix D.4 perf.conf parameter" for the settings.


Contents Index PreviousNext

All Rights Reserved, Copyright(C) FUJITSU LIMITED 2008