ETERNUS SF Storage Cruiser User's Guide 13.2 - Solaris (TM) Operating System / Linux / Microsoft(R) Windows(R) -

Chapter 7 Performance Management

7.1 Overvie w

This product supports performance management functionality for fibre channel switch and ETERNUS disk array devices. This functionality enables users to get details about the operation and load statuses of devices. However, this product does not support performance management functionality in relation to the ETERNUS disk array device main frame volume, MVV and SDV.

The performance information can be referenced using Systemwalker Service Quality Coordinator. However, some performance information is not supported. For details, refer to the Systemwalker Service Quality Coordinator manual.

For details about supported devices, refer to "1.3.5 Support levels".

Do not execute performance monitoring for a single device from more than one administrative server(*) at the same time.
* The Softek Storage Cruiser administrative servers, the ETERNUS SF Storage Cruiser administrative servers or the Systemwalker Resource Coordinator administrative servers
The performance management functionality always starts up when the administrative server starts up. For devices in which performance management settings have been configured, performance information collection starts in the background. Accordingly, performance information is collected regardless of whether GUI window display. To stop performance information collection, execute performance management stop processing.

7.1.1 Performance Information Types

The following information can be managed: Performance monitoring can be set at different intervals, refer to "7.2.3 Setting Monitoring Intervals" for the settings available for each device.

Fibre Channel switch

Performance information (Unit) Fibre Channel switch

Port Transfer rates of send/receive data (MB/S) supported

Number of CRC errors supported

ETERNUS disk array device

	Performance information (Unit)	ETERNUS8000 ETERNUS4000 (Except ETERNUS4000 model 80/100)	ETERNUS6000	ETERNUS4000 model 80/100, ETERNUS3000 (Except model 50), ETERNUS GR series (ETERNUS GR720 or higher)	ETERNUS2000
LUN LogicalVolume RAIDGroup	Read count and write count (IOPS / IO per second)	supported	supported	supported	supported
	Read and write data transfer rate (MB/S)	supported	supported	supported	supported
	Average response time for read and write (msec)	supported	supported	supported	supported
	Read, pre-fetch, and write cache hit rate (%)	supported	supported	supported	supported
Disk drive	Disk busy rate (%)	supported	supported	supported	supported
CM	Load (CPU usage) rate (%)	supported	supported	supported	supported
CM	Copy remaining amount (GB)	supported	supported	not supported	supported
CA	Load factor (CPU usage rate) (%)	not supported	supported	not supported	not supported
	Read count and write count (IOPS / IO per second)	supported	supported	not supported	not supported
	Read and write data transfer rate (MB/S)	supported	supported	not supported	not supported
CM Port	Read count and write count (IOPS / IO per second)	not supported	not supported	not supported	supported
CM Port	Read and write data transfer rate(MB/S)	not supported	not supported	not supported	supported
DA	load factor (CPU usage rate) (%)	not supported	supported	not supported	not supported
	Read count and write count (IOPS / IO per second)	not supported	supported	not supported	not supported
	Read and write data transfer rate (MB/S)	not supported	supported	not supported	not supported

When RAID Consolidation is performed in ETERNUS6000 the RAIDGroup response time will not be displayed.
Performance information about the RAIDGroup functionality including LogicalVolume(LUN)and LogicalVolume created in LUN Concatenation on the ETERNUS disk array device cannot be displayed.
Performance information about the ETERNUS disk array mainframe volume and MVV/SDV are not supported, and performance information values for RAIDGroup cannot be guaranteed.
On ETERNUS6000 devices, CA performance information relates to the FC-CA or FC-RA port. Furthermore, performance information obtained during the execution of REC and ROPC is displayed as Write performance information, regardless of whether FC-RA port is set up on the Initiator side or the Target side.

7.1.2 Performance Graph Window Types

This software product also provides graph windows with the following time units:

One-hour Graph window

Based on the time selected as the performance monitoring interval, a line graph for a one-hour period is displayed.

Examples are as follows:

If the performance monitoring interval is 30 seconds, a line graph for a one-hour period with values obtained at an interval of 30 seconds is displayed.

If the performance monitoring interval is 60 seconds, a line graph for a one-hour period with values obtained at an interval of one minute is displayed.

If the performance monitoring interval is 300 seconds, a line graph for a one-hour period with values obtained at an interval of five minutes is displayed.

If the performance monitoring interval is 600 seconds, a line graph for a one-hour period with values obtained at an interval of ten minutes is displayed.

Values displayed on the graph are the mean value of the performance monitoring interval. However the CM copy remaining volume graph displays information obtained at the time the graph was produced.
One-day Graph window

Based on the mean values for 10-minute periods, a line graph for a one-day period is displayed.
One-week Graph window

Based on the mean values per hour, a line graph for a one-week period is displayed.

If the summer time functionality of the operating system is enabled, a part of the graph that is supposed to show the data around the switch to summer time may not be displayed correctly.

7.1.3 Threshold Monitoring Types

Threshold monitoring is supported for fibre channel switch and ETERNUS disk array devices.

The threshold monitoring functionality sends an alarm or report when a storage or switch performance value reaches a certain level (threshold value) under certain conditions in daily transaction operations.

The advantage of using the threshold monitoring is that the symptom of a storage or switch performance drop caused by changes in data processing rates and transaction processing rates can be automatically and reliably detected in daily transaction operations.

The effect that can be expected from the threshold monitoring functionality is that operations in the most suitable environment are enabled. This is accomplished by preventing the adverse effect of performance drops through early detection of bottleneck locations, identification of their causes, and improvement of device configuration.

The threshold monitoring functionality can manage the following information:

Fibre Channel switch

Port throughput (%)

A port throughput value (MB/s) is monitored as the percentage (%) of an allowable tolerance to a maximum transfer capability (MB/s).
ETERNUS disk array device

Response time (msec) of LUN (OLU)

Average use (busy) rate (%) of RAIDGroup (RLU, LUN_R)

CM load (CPU usage) rate (%)

The threshold monitoring functionality provides the Threshold Monitoring Alarm Log and Condition Report windows.

From all devices monitored by the functionality, the Threshold Monitoring Alarm Log window displays a list of threshold monitoring alarm items detected on individual devices.

The Condition Report window provides the following four windows:

Logical Volume response time error

This is displayed if the Logical Volume response time of a monitored device is found to have reached the state specified in threshold settings. Guidelines for actions to be taken are also displayed.
RAID Group load error

This is displayed if the RAID Group utilization of a monitored device reaches the specified state. Guidelines for actions to be taken are also displayed.
CM load error

This is displayed if the CM load ratio of a monitored device reaches the specified state. Guidelines for actions to be taken are also displayed.
Port throughput load error

It is displayed when the transfer/receive usage rate of port of a monitored device reached the specified state. Measures indicator is also displayed.

The threshold monitoring functionality is always started when the administrative server is started, and the functionality starts threshold monitoring and obtaining performance information at the same time for a device for which performance management is specified. Accordingly, performance information is obtained and thresholds are monitored regardless of when display of the GUI window starts. To end threshold monitoring, stop threshold monitoring from this software product.

7.2 Flow of Performance Managemen t

When a user gives an instruction for performance management of a target device from a GUI window, the performance management unit issues SNMP Traps periodically through a LAN to devices to obtain performance information, and it saves the information as performance data on the administrative server. This software product displays the performance data in the Performance Management window and manages the device.

7.2.1 Checking disk space on the administrative serve r

To conduct performance monitoring, sufficient disk space is required on the administrative server for performance data storage. Make sure that sufficient disk space is ensured referring to the Installation Guide. This software product is capable of deleting performance data overdue the specified holding period. The default is seven days and data exceeded this period are automatically deleted. This period can be modified. To change the number of days to store performance data refer to "7.7 Definition File".

7.2.2 Instruction for performance managemen t

To display the dialog for setting the monitoring state, click the target device in the SAN view of the resource view, and then select [Device(D)]-[Performance management(S)] from the menu or right-click [Performance management] from the popup menu.

ETERNUS disk array performance management settings window

Fibre Channel switch performance management settings window

In ETERNUS disk arrays, enter the minimum and maximum values for the performance information securement target Logical Volume (LUN_V). Setting LUN can reduce the effect of obtaining performance information on disk areas where performance data is saved and reduce the load. Setting LUN prevents allocation of more space than necessary for Logical Volume. Consequently, the recommended entry for Logical Volume is the minimum value for obtaining performance data.

Do not execute performance monitoring for a single device from more than one administrative server(*) at the same time.
* The Softek Storage Cruiser administrative servers, the ETERNUS SF Storage Cruiser administrative servers or the Systemwalker Resource Coordinator administrative servers
Since performance information is obtained internally in units of 64 for Logical Volume, performance information in the neighboring area of the Logical Volume setting is also obtained. (For example, if a range of 70 to 80 is set for Logical Volume (LUN_V) on the screen, the information of Logical Volume between 64 and 127 is obtained internally.)

If the device configuration has changed, update the device configuration information maintained by the performance management functionality. For details about the update procedure, refer to "7.2.11 Updating configuration information".

7.2.3 Setting monitoring interval s

Enter the interval at which performance information is secured in the ETERNUS disk array and Fibre Channel switch common settings. You can specify 5, 10, 30, 60, 300, or 600 seconds as the interval. However, intervals that can be specified vary depending on device models and the number of Logical Volumes indicating performance level being maintained.

Monitoring condition		Specifiable interval
Device model name	Number of LogicalVolumes whose performance is maintained	Specifiable interval
ETERNUS4000(M80,100) ETERNUS3000 ETERNUS2000 GR740,820,840 GR720,730	128 or less	5/10/30/60/300 seconds
	129 to 2,047	30/60/300 seconds
	2,048 or more	60/300 seconds
ETERNUS6000	64 or less	10/30/60/300 seconds
	65 to 2,047	30/60/300 seconds
	2,048 or more	60/300 seconds
ETERNUS8000 ETERNUS4000(Except for M80,100)	256 or less	30/60/300/600 seconds
	257 to 1024	60/300/600 seconds
	1025 to 8192	300/600 seconds
	8193 or more	600 seconds
ETERNUS SN200 ETERNUS SN200 MDS	-	5/10/30/60 seconds

In LAN traffic or cross-subnet (where there is a gateway between the performance monitoring target device and the administrative server) situations, it may not be possible to obtain the performance information inside the monitoring interval that was set. Please change the monitoring interval when you cannot acquire performance information on each monitoring interval.

When you click [Start] in this dialog, an instruction to obtain performance information is issued to the performance management unit (see the figure in "Flow of Performance Management"), and the performance management unit obtains performance information of the device through the LAN and saves it as performance data. Since the performance management unit is started as a daemon of the administrative server, the unit continues obtaining performance information while the administrative server is active, even if no GUI window is displayed.

The logical configuration of the storage device is recognized, and the obtaining of performance information starts. At the start of obtaining performance information of the selected storage system, an amount of time (tens of seconds to several minutes) is spent to obtain the logical configuration before any performance information is actually obtained.

When performance monitoring starts, the "P" mark appears in green at the upper left of the device icon on the map display. If the Performance Management window is open and the corresponding Fibre Channel switch and storage system are displayed in the tree, the device name is displayed in the same color as that of the "P" mark.

The table below lists "P" mark colors and their corresponding statuses and actions to be taken. The color may differ from that of the current status. Click [Refresh] on the GUI window or press the F5 key to check the latest status.

"P" mark color	Status	Appropriate action
Green	Performance is being monitored. (Normal)	Performance is being monitored.
Yellow	Performance monitoring is being recovered (e.g. device time-out).	The administrative server cannot communicate with a device. Check the network status and device status. Log off, when ETERNUSmgr/GRmgr is in a login state.
Red	Writing to the performance information file failed.	Check the write permission to the file and the capacity of the file system.
	Wrong registered password of GR	Reregister devices with this software product, and restart performance monitoring.
	Internal error	Contact a FUJITSU maintenance engineer.

7.2.4 Starting the performance management windo w

Start the Performance Management Window to display performance information. To open the Performance Management window, select [File]> [Performance Management Window] from the GUI menu or right-click [Performance Management Window] from the popup menu.

7.2.5 Displaying performance information of the selected devic e

You can display the performance information of a device in the Performance Management window by dragging a device icon displayed in the resource view and dropping it in the Performance Management window. You can create multiple Performance Management windows. Also, multiple device icons can be dropped in a single Performance Management window to display information about those devices.

7.2.6 Displaying Fibre Channel switch performance informatio n

From the device tree view in the Performance Management window, select and right-click the port number of the Fibre Channel switch whose performance information you want to display, and then select [Show Performance Graph] from the resulting popup menu.

The dialog shown below appears. In the dialog, select the items to be displayed.

Time Axis	Select the time interval for a graph that you want to display. Select one hour, one day, or one week. Refer to "7.1.2 Performance Graph Window Types" for details.
Date Specification	Specify the date and time to be displayed the center of the graph. The current time is displayed. You may select the date and time of a graph that you want to display. A period of up to 7 days can be specified.
Throughput	Displays a data transfer rate (MB/S). Here, select an information type. Select one of the following three: (1) displaying the graphic window for the transmitting-side performance of a port, (2) displaying the graphic window for the receiving-side performance, and (3) displaying one graphic window where both transmission and reception performances are displayed at the same time. The transmission and reception performances can be selected at the same time but, if the combined transmission-reception performances are selected, neither the individual transmission performance nor the reception performance can be selected.
Error	Displays the graph of a CRC error count that occurs in the port.

The window shown below displays a result of selections in the above dialog. One-hour Graph windows of send throughput and receive throughput are displayed. Using the graphs, you can determine the operating status of the port. See "B.10.5 Graph window functions".

To display performance information of multiple ports at the same time, click multiple ports in the tree while holding down the Ctrl or Shift key, and then right-click a selected port to display a graph.

In this case, "Open Window for Every Port" is added to the dialog displayed immediately before a graph is displayed. If you check the check box, one window for each port opens.

If you do not check the check box, you can select "Total of Throughput" in the dialog. If you select it, the total values of the ports are displayed in a graph. Otherwise, values of each port are displayed in the same graph window. Incidentally, if "Send/Receive" is selected, "Total of Throughput" must be selected.

The window shown below is an example where "Open a window for each port" and "Total of Throughput" are not selected. To check the correspondence between ports and lines in the graph in the window, select a button for a line in the graph. In this example, port 14 is frequently used.

7.2.7 Displaying storage performance informatio n

When the ETERNUS disk array device icon is dragged and dropped to the performance management window, a storage logic configuration tree will be displayed as below.

"AffinityGroup" indicates a number of the zone functionality of the selected storage system.

"LUN" indicates a logical unit number from the point of view of the server node. Since this is allocated with Logical Volume (OLU and LUN_V) that specifies a number unique to a device managed in the device, this is expressed as "LUN X(Logical Volume X)" in the tree.

"RAID Group" located under "LUN" indicates that LUN is included in "RAID Group" (rank). [Disk](=physical drive) under [RAIDGroup] or [RAIDGroup [X- X]] indicates the drive used to configure the rank. [LogicalVolume] under [RAIDGroup] or [RAIDGroup [X- X]] indicates the numbers of other LogicalVolumes that belong to the same RAIDGroup. [RAIDGroup X- X] also has devices that are not shown.

The properties are displayed as tool tips. For details about items that can be checked in these tool tips, refer to "B.10.3 The tree view".

Figures beginning with "0x" are values expressed in hexadecimal notation. Other numbers are decimal numbers.

7.2.7.1 Displaying LUN and RAIDGroup performance informatio n

From the device tree in the Performance Management window, select the number of the LUN or RAID Group whose performance information you want to display, right-click to display a popup menu, and select [Show Performance Graph].

You can select multiple numbers. To do so, click LUN or RAID Group while holding down the Ctrl or Shift key, right-click and slect [Show Performance Graph].

The dialog shown below appears. In the dialog, select the graph window to be displayed.

Time Axis	Select the time interval for a graph that you want to display. Select one hour, one day, or one week. Refer to "7.1.2 Performance Graph Window Types" for details.
Date Specification	Specify the date and time to be displayed in the center of the graph. The current time is displayed. You may select the date and time of a graph that you want to display. A period of up to 7 days can be specified.
IOPS	Indicates how many times I/O is issued per second.
Throughput	Displays a data transfer rate (MB/S).
Response time	Displays an average I/O processing time (ms).
Cache hit rate	Displays a ratio (%) at which cache is hit.

When RAID Consolidation is performed in ETERNUS6000 the RAIDGroup response time will not be displayed.
Performance information about the RAIDGroup functionality including LogicalVolume(LUN)and LogicalVolume created in LUN Concatenation on the ETERNUS disk array device cannot be displayed.
The ETERNUS disk array mainframe volume and performance information about MVV/SDV are not supported. The value for the performance information about the RAID Group containing SDV also cannot be guaranteed.

* For the IOPS, throughput, and response time, one of the following three can be selected: (1) displaying a READ graphic window, (2) displaying a Write graphic window, and (3) displaying one graphic window where R/W (Read and Write information) items are displayed at the same time. Read and Write can be selected at the same time but, if R/W is selected, the individual Read and Write graphic windows cannot be selected.

* For the cache hit ratio, one of the following four can be selected: (1) displaying a Read hit-ratio graphic window, (2) displaying a Write hit-ratio graphic window, (3) displaying a pre-fetch hit-ratio graphic window, and (4) displaying one graphic window where all R/W/P information (Read, Write, and Pre-fetch hit ratios) is displayed at the same time. Read, Write and pre-fetch can be selected at the same time but, if R/W/P is selected, the individual Read, Write, and pre-fetch graphic windows cannot be selected.

If multiple logical units are selected be displayed on a graph, "Open Window for Every LUN" is displayed. Select it to open one graph window for each LUN.

If it has not been selected, "Total" is displayed in the dialog. If you select "Total," the "Total" graph appears. Otherwise, the information about multiple units is displayed in the same graph window. If "R/W/P" or "R/W" is selected, "Total" must be selected.

7.2.7.2 Displaying disk (physical drive) performance informatio n

From the device tree view in the Performance Management window, select the number of the disk whose performance you want to display, right-click to display a popup menu, and select [Show Performance Graph].

You can select multiple disks. To select multiple disks, click multiple disks while holding down the Ctrl or Shift key, right-click and select [Show Performance Graph].

The dialog shown below appears. In the dialog, select the graph window to be displayed.

Time Axis

Select the time interval for a graph that you want to display. Select one hour, one day, or one week.

Refer to "7.1.2 Performance Graph Window Types" for details.

Date Specification

Specify the date and time to be displayed in the center of the graph. The current time is displayed. You may select the date and time of a graph that you want to display. A period of up to 7 days can be specified.

If multiple logical units are specified for displaying a graph, "Open Window for Every Disk" is displayed in the dialog. If you select it, one graph window opens for each disk. Otherwise, the information about multiple disks is displayed in the same graph window.

7.2.7.3 Module performance vie w

To display the performance graph, select the module (CM, CA, CMPort, DA) from the performance management window and use a right mouse click to display the pop up menu, and then click on [performance graph display]. Multiple modules can be selected by holding down the [Ctrl] key or the [Shift] key while clicking on the modules. When DA is selected the DA Performance Graph dialog shown below is displayed and when CA or CM Port is selected their respective performance graph dialogs will display.

From the performance graph dialog, select the options for the particular graph you wish to be displayed in the graph window.

Time Axis	Select the time interval for a graph that you want to display. Select one hour, one day, or one week. Refer to "7.1.2 Performance Graph Window Types" for details.
Date Specification	Specify the date and time to be displayed in the center of the graph. The current time is displayed. You may select the date and time of a graph that you want to display. A period of up to 7 days can be specified
CPU	Displays the CPU usage (%) of DA or CA.
IOPS	Displays the number of I/O issued per second of DA, CA port, or CM Port.
Throughput	Displays the data transfer volume (MB/S) of DA, CA port, or CM Port.

CA's on ETERNUS8000 and ETERNUS4000 (except for M80 and 100) do not use CPU resources; therefore CPU will not display on CA performance graphs.

When CM is selected the following CM Performance Graph will display.

A chart window can be selected on this dialog.

Time Axis	Select the time interval for a graph that you want to display. Select one hour, one day, or one week. Refer to "7.1.2 Performance Graph Window Types" for details.
Date Specification	Specify the date and time to be displayed in the center of the graph. The current time is displayed. You may select the date and time of a graph that you want to display. A period of up to 7 days can be specified.
Graph	Load	Displays the CPU usage (%) of CM module.
	Copy Residual Quantity	Displays the remaining copy volume (GB) of advanced copy (EC/OPC). When both EC and OPC are operating, a total of the remaining copy volumes of EC and OPC is displayed.

For the ETERNUS3000 (except M50) and GR (ETERNUS GR720 and higher), the CA Performance Graph dialog box does not display Copy Residual Quantity data.

Selecting "Open Window for Every Port" and "Open Window for Every CM" on the dialog when multiple items are selected displays chart windows for respective modules.

7.2.8 Operating graph window s

If the amount of performance data is large (in particular, when the ETERNUS disk array device RAIDGroup or multiple selections are chosen), or if load on a LAN is heavy, a long time may be required to display a graph after the [Previous Hour] or [Next Hour] button is clicked. In such cases, right-click the mouse on the graph window to open a popup menu. The popup menu has a command for opening a graph window from which the time range of a graph can be changed. Select the command, and open the One-day Graph window from the graph window so that hour-by-hour information is displayed, move the cursor in the One-day Graph window to the time that you want to check, right-click to display a popup menu, and select [One-hour Graph Window] for a smooth transition to a graph centered on this time.

To display the maximum value graph, click the [Peak] button in the One-day Graph window, or One-week Graph window. You can then move the cursor to the time of the maximum value and right-click to display a popup menu, enabling a smooth transition to a graph centered on this time of the maximum value in the same way as described above.

Refer to "B.10.5 Graph window functions".

7.2.9 Examples of use of performance managemen t

If an I/O delay from the server node to a storage system occurs, the user can check for the cause in the storage system by using the methods described below. These are only examples, so all causes of I/O delays cannot be determined by use of these methods.

Identify the time when the I/O processing delay occurred and the access path where the delay occurred.
Use this product to check the AffinityGroup number and LUN number of the ETERNUS disk array defined in the target access path.
Using performance management, display and check the target LUN performance values.
If a response of the LUN unit takes a long time, check RAID Group performance. If a response of RAID Group also takes a long time, find another Logical Volume belonging to RAID Group, and find the LUNs to which the Logical Volume is allocated. Check the I/O statuses of these LUNs, and check for a heavy load on RAID Group. If there is a heavy load, move the appropriate LUN to another RAID Group, or take other appropriate action.

7.2.10 Instruction for stopping performance managemen t

Click the target device on the GUI window. Select [Device]-[Performance management(S)] from the menu, or right-click and select [Performance management] from the resulting popup menu. Then, select [Stop] in the window for setting the monitoring status.

7.2.11 Updating configuration information

Device configuration information is independently maintained in the performance management functionality.

To change the device configuration, update the device configuration information that is maintained by the performance management functionality according to the procedure shown below. Also perform the update if the configuration for a device used to execute performance monitoring and threshold monitoring has been changed.

If the configuration for a device used to execute performance monitoring and threshold monitoring has been changed, the configuration information before the update is used for performance monitoring and threshold monitoring. Performance information and threshold monitoring, shown in the procedure below, cannot be guaranteed prior to the configuration information update.

Record the performance monitoring settings contents (if performance monitoring is used)

<Recorded settings contents>
- Interval before obtaining performance information (in seconds)
- Performance monitoring targets (Minimum LUN_V, Maximum LUN_V)
Record the threshold monitoring settings contents (if threshold monitoring is used)

<Recorded settings contents>
- Threshold monitoring time settings
  - Threshold monitoring time (start time, stop time)
  - Alarm display time (start time, stop time)
  - Alarm display frequency
- Threshold monitoring settings
  - Target
  - Threshold
  - Threshold monitoring Interval
  - Alarm tolerance level
  - Alarm tolerance time
  - Rearm
Stop threshold monitoring (if threshold monitoring is used)

Refer to "7.3.7 Instruction for stopping threshold monitoring".
Stop performance monitoring (if performance monitoring is used)

Refer to "7.2.10 Instruction for stopping performance management".
The composition of the device is changed.
In the menu bar of the Performance Management window, click [Device] > [Create Device Configuration].
Start performance monitoring based on the settings contents recorded in 1. (if performance monitoring is used).

Refer to "7.2.2 Instruction for performance management" and "7.2.3 Setting monitoring intervals".
Start threshold monitoring based on the settings contents recorded in 2. (if threshold monitoring is used).

Refer to "7.3.3 Setting the threshold monitoring hours" and "7.3.4 Setting the threshold monitoring information".

7.2.12 Performance data

Performance data is saved in CSV files in the following directory of the administrative server:

[Solaris OS version of Manager] /var/opt/FJSVssmgr/current/perf/

[Linux version of Manager] /var/opt/FJSVssmgr/current/perf/

[Windows version of Manager] administrarive-server-work-directory\Manager\var\opt\FJSVssmgr\current\perf

You can save these files by saving the entire directory as necessary, and you can display old information as necessary by recovering it in the same format.

However, the automatic deletion functionality operates in the performance data. When restoring, execute it after confirming days of the data retention duration. For the data retention duration, refer to "D.4 perf.conf Parameter".

Example: (Solaris OS)

Backup
# cd /var/opt/FJSVssmgr/current/perf/
# tar -hcf - csv |compress -c > csv.backup.tar.Z
Restoration
# cd /var/opt/FJSVssmgr/current/perf/
# uncompress -c /var/opt/FJSVssmgr/current/perf /csv.backup.tar.Z | tar -xvf -

7.3 Flow of Threshold Monitorin g

When a user uses the Performance Management window to issue an instruction for threshold monitoring of the devices subject to performance management, the performance management unit of this software product periodically issues SNMP Traps through the LAN to the devices to obtain device performance information. The threshold-monitoring unit then sequentially analyzes the performance information.

If a problem is detected from the performance information, it is displayed as an alarm in the event log in the SAN Management window and displayed in the Threshold Monitoring Alarm Notification Log window of the Performance Management window.

When a problem is displayed as an alarm in the event log, the event level is shown as [Warning].

7.3.1 Checking disk space on the administrative serve r

For threshold monitoring, users must prepare disk space on the administrative server to save condition reports of the threshold monitoring alarm notification log. About 4 MB is required. Make sure that the server has sufficient disk space. This software product has a functionality for deleting condition report data whose set retention duration days have passed. Data stored for 366 days or more is automatically deleted. Change this retention duration setting as required. For details, see Alarm Delete Setting in the "B.10.6 Threshold monitoring dialog functionality".

7.3.2 Instruction for threshold managemen t

Select the device name tree node from the GUI window Performance Management window tree, and select [Threshold Monitoring] from the menu. Then, a variety of threshold monitoring setup menus appear. To use the menus, however, performance information of the device must have already been obtained by performance management.

Select [Monitoring Enable/Disable(E)]. This enables threshold monitoring, and you can set up different kinds of threshold monitoring.

7.3.3 Setting the threshold monitoring hour s

To set a time period for threshold monitoring, select [Monitoring Time Setting]. If no time period is set, thresholds are monitored and alarms are reported for all time periods. Large volumes of threshold monitoring alarm logs may be reported depending on threshold settings. Users are recommended to set a time period if performance is a concern in a system environment where load varies considerably depending on how to use a target device.

7.3.4 Setting the threshold monitoring informatio n

Next, select [Threshold Setting/Start Monitoring/Stop Monitoring(S)], define threshold setting information of the target device, and issue an instruction to start monitoring. Threshold monitoring then starts. Incidentally, since the threshold monitoring unit is started as a daemon of the administrative server along with the performance management unit, the threshold monitoring unit continues threshold monitoring while the administrative server is active, even if no GUI window is displayed. Moreover, refer to "B.10.6 Threshold monitoring dialog functionality" for the setting of threshold setting information.

7.3.5 Displaying threshold monitoring alarm log s

To display threshold monitoring alarm logs, open the Performance Management window, select [Threshold monitoring] from the menu bar, and select [Thresholds Alarm Log] from the displayed list. The displayed list is a list of logs of alarms detected by threshold monitoring. To open the Performance Management window, select [File(F)]-[Performance Management Window(S)] from the GUI menu.

7.3.6 Displaying condition report s

Details of the list displayed by [Thresholds Alarm Log] are displayed. Users can determine the appropriate actions and guidelines to take for different threshold monitoring alarms in the displayed report logs. To display the details, move the cursor to the report line of the threshold monitoring alarm log to be referred to, and double-click on the line.

7.3.7 Instruction for stopping threshold monitorin g

From the Performance Management window menu in the GUI window, select [Threshold Monitoring(T)]-[Threshold Setting/Start Monitoring/Stop Monitoring(S)]. To end threshold monitoring, click the [Stop] button in the window displayed for setting a threshold.

7.4 Evaluation Criteria for Thresholds in Threshold Monitorin g

Storage thresholds

Standard storage thresholds are listed below.

	Online response-oriented system	Batch throughput-oriented system
LogicalVolume Response	30 ms or less	-
RAIDGroup Busy Rate	60% or less	80% or less
CM Busy Rate	80% or less	90% or less

For a response-oriented system such as for online applications, implementing LogicalVolume responses within 30 ms is a standard for stress-free storage operation. To implement responses within 30 ms, suppress the RAIDGroup busy rate to 60% or less and the CM busy rate to 80% or less.

In a throughput-oriented system such as for batch applications, LogicalVolume responses are extended to a few milliseconds because sequential access increases the cache hit ratios. However, the cache hit ratios are greatly affected by application access and the values change a great deal. As a result, responses may vary from a few milliseconds to 50 ms or more. Thus, for batch applications, there is no standard threshold for LogicalVolume responses.

To improve throughputs for batch applications, users must optimize the use of storage resources. Note, however, that performance may deteriorate rapidly if the above threshold is exceeded. For this reason, make it a standard to suppress the RAIDGroup busy rate to 80% or less and the CM busy rate to 90% or less.

During hours in which advanced copy is processing, advanced copy processing itself increases the CM busy rate.
For this case, set thresholds in consideration of execution of advanced copy.

Thresholds of switch ports

The threshold and lower limit value vary depending on the type of applicable switch (1 Gbps or 2 Gbps). When the switch type is 1 Gbps, the maximum value (100% use rate) is 100 MBps. When the switch type is 2 Gbps, the maximum value is 200 MBps. For the threshold, specify the allowable utilization (%) corresponding to the maximum value.

For instance, when 90% is specified for the threshold of a 2 Gbps type switch, the threshold of the throughput is:
200 MBps x 90% = 180 MBps (total of reception and transmission)
An alarm is given when the throughput reaches 180 MBps.

Switch port threshold monitoring is effective when there is a difference in the numbers of paths between the server and storage.

7.5 Examples of Threshold Monitorin g

This section provides an overview of threshold monitoring in the form of key examples to enable users to determine what situations require thresholds and the types of thresholds that should be set for them.

Case 1: Online application system at company A

+Material 1: System operation standard and performance requirements (excerpt)

Online application service hours: 8:00 to 18:00 everyday
Online application busy hours: 12:00 to 15:00 everyday
This system requires that operator terminal operation be stress-free even with workload during the busy hours.
Therefore, the target performance of I/O response shall be "30 ms or less," which is a general standard.
The target I/O response performance in hours other than the busy hours shall be "10 ms or less," one third of 30 ms, according to the workload proportion (the workload in the busy hours is about three times higher than that in other hours).
During the busy hours, processing for data reference, updating, and addition may occur concurrently and continue for up to 60 minutes.
If a state in which an I/O response taking 30 ms or more occurs for a period equivalent to 10% (6 minutes) of the said continuous execution, operation at the operator terminal may undergo stress. Therefore, make the settings so that an alarm log will be generated when such state occurs.
If I/O responses during the busy hours come down to 10 ms or less, the same as the performance target in other hours, the I/O response delays that occurred previously shall be deemed as instantaneous symptoms.
Therefore, an alarm log need not be generated when this state occurs.
The event log need not be displayed every time an alarm log is generated but can be displayed only once a day.
(This is because the system administrator checks the condition report once a day.)

+Illustration of operational status of company A's online application system (transition of LogicalVolume responses)

+An example of threshold monitoring setting for company A's online application system is shown below:

Number corresponding to material 1	Setting item	Setting
1	Threshold Monitoring Time	8:00-18:00
2	Alarm Display Time	12:00-15:00
3	Target	LogicalVolume Response
3	Threshold	30 ms
4	Threshold monitoring Interval	60 minutes
4	Alarm Tolerance Level	Total time: 360 seconds
5	Rearm	10 ms
6	Alarm Display Frequency	Day by day

Case 2: Online shopping system of company B

+Material 2: System operation standard and performance requirements (excerpt)

Online application service hours: 24 hours a day for 365 days a year
Online application busy hours: Cannot be specified.
This system features that the number of accesses gradually increases as the number of member customers increases after start of the production run. It is assumed that the load on storage will also increase gradually. Measures need to be taken when the busy rate of storage resources (CM and disk) comes over 60% to 80%.
This system executes credit card transactions every 5 minutes. Therefore, for five minutes immediately before each transaction, product retrieval and order processing must be executed without stress. If the storage resource is kept in busy state (a state in which the busy rate exceeds 60% to 80%) for five minutes, transactions may be affected. Therefore, make settings so that an alarm log will be generated when such state occurs.
Event log shall be displayed every time an alarm log is generated. The system administrator checks the condition report when an event log is displayed.

+Illustration of operational status of company B's online shopping system (transition of CM busy rate)

+An example of threshold monitoring setting for the company B's online shopping system is shown below:

Number corresponding to material 2	Setting item	Setting
1	Threshold Monitoring Time	0:00-24:00
2	Alarm Display Time	0:00-24:00
3	Target	CM Busy Rate
3	Threshold	60%
4	Alarm Tolerance Level	Continuous time: 300 seconds
5	Alarm Display Frequency	All

Number corresponding to material 2	Setting item	Setting
1	Threshold Monitoring Time	0:00-24:00
2	Alarm Display Time	0:00-24:00
3	Target	RAIDGroup Busy Rate
3	Threshold	80%
4	Alarm Tolerance Level	Continuous time: 300 seconds
5	Alarm Display Frequency	All

Case 3: Batch processing with multiple database servers (cluster system) of company C

+Material 3: System operation standard and performance requirements

System service hours: 24 hours a day, 365 days a year
Batch processing hours: 20:00 to 23:00 every night
This cluster system is an Oracle RAC system consisting of three nodes. There is no problem with the batch processing performance because the amount of processed data is currently small. As the amount of data increases in the future, however, we have concerns over bottlenecks in the performance of FC path transfer between the FC switch and storage.
If an FC path bottleneck occurs, it must be eliminated quickly.
Assume the state in which the port throughput reaches about 80% of the maximum transfer capability as an FC path bottleneck, and make settings so that an alarm log is generated when such state continues for 30 minutes or more.
Event log need not be displayed every time an alarm log is generated but can be displayed only once even when an alarm log is generated more than once in the batch processing hours. The system administrator checks the condition report when an event log is displayed.

+Illustration of batch processing with multiple database servers (cluster system) at company C (transition of port throughputs)

+An example of threshold monitoring setting for business system backup operation at company C is shown below:

Number corresponding to material 3	Setting item	Setting
1	Threshold Monitoring Time	0:00-24:00
2	Alarm Display Time	20:00-23:00
3	Target	Port Throughput
3	Threshold	80%
4	Alarm Tolerance Level	Continuous time: 1,800 seconds
5	Alarm Display Frequency	Every monitoring time

7.6 Condition Report and Corrective Measures for Problem s

7.6.1 Delay in LogicalVolume respons e

Report detail:
A delay in response of LogicalVolume YYYY defined in RAIDGroup XXXX was detected.

The monitoring states of other presumably related targets are as follows:
<Monitoring state of each CM> <- (1)
[CM0x00] High load state detected
[CM0x01] Not detected
[CM0x10] Not detected
[CM0x11] Not detected

RAIDGroup XXXX monitoring state: High load state detected <- (2)

The block size of I/O in which a response delay was detected is 8K bytes. <- (3)

Related graph:
Refer to the LogicalVolume YYYY response time graph.

Guidelines for corrective measure:
1. RAIDGroup is probably in the high load state. Check the alarm of the RAIDGroup busy rate (disk utilization) and refer to the guidelines for corrective measures.
2. CMCM is probably in high load state. Check the alarm of the CM busy rate and refer to the guidelines for corrective measures.
3. It is assumed that I/O processing takes time because the block size is large. Review the threshold.

(1)

Indicates the state of each CM during the same time zone as the LogicalVolume response delay is detected.

High load state detected	An alarm log indicating a CM load error is generated during the same time zone. If the CM in charge of the LogicalVolume in which a response delay was detected is in the high load state, a response delay due to a CM bottleneck is assumed. Refer to the guidelines for corrective measures for alarms for the relevant CM.
Monitoring	The CM is monitoring for alarm detection because the threshold has been exceeded several times although no events have been detected as an alarm.
Not detected	[When the CM busy rate is defined as a monitoring target] No CM bottleneck has occurred during the same time zone.Check the RAIDGroup defining the LogicalVolume for alarms.
Not detected	[When the CM busy rate is not defined as a monitoring target] The CM busy rate is not monitored. ("Not detected" is displayed irrespective of the CM busy rate.)

(2)

Indicates the state of the RAIDGroup in which the relevant LogicalVolume is defined during the same time zone as the LogicalVolume response delay is detected.

High load state detected	An alarm log indicating a RAIDGroup load error is generated during the same time zone. A response delay due to bottlenecks of disks configuring a RAID is assumed. Refer to the guidelines for RAIDGroup corrective measures for alarms for the relevant RAIDGroup.
Monitoring	The CM is monitoring for alarm detection because the threshold has been exceeded several times although no events have been detected as an alarm.
Not detected	[When the RAIDGroup busy rate is defined as a monitoring target] There is no bottleneck of disks comprising the RAID in the same time zone.
Not detected	[When the RAIDGroup busy rate is not defined as a monitoring target] The RAIDGroup busy rate is not monitored. ("Not detected" is displayed irrespective of the RAIDGroup busy rate.)

(3)

Indicates the I/O block size at the time of detection of a LogicalVolume delay response.
If the CM in charge of the relevant LogicalVolume or the defined RAIDGroup is not in the high load state, an alarm is not attributable to CM and disk bottlenecks but may be attributable to an unreasonably large I/O block size for the threshold.

For instance, when a response delay is detected while the threshold of the LogicalVolume response is set to 30 ms, no CM and disk bottlenecks have occurred, but the I/O block size may be 512K bytes. In this case, the most highly probable cause of the response delay is a large I/O block size. Generally, the larger the I/O block size, the larger the response delay. When the standard response is set to 30 ms, an I/O block size of 512K bytes or more is a rough indication for a response delay due to the I/O block size.
Take measures such as reviewing the threshold of the LogicalVolume response or reducing the I/O block size in the application.

7.6.2 RAIDGroup load erro r

Report detail:
RAIDGroup XXXX in high load state was detected.

Related graph:
Refer to the graph of the RAIDGroup XXXX busy rate (disk utilization).
Refer to the IOPS graph about each LogicalVolume in the RAIDGroup XXXX.

Guidelines for corrective measure:
1. I/O may be concentrated on the LogicalVolumes in the same RAIDGroup. <- (1)
Relocate the LogicalVolumes in the relevant RAIDGroup to other RAIDGroups (or newly created RAIDGroups) to distribute I/O.

(1): Indicates the guidelines for corrective measures for disk bottlenecks.
Take measures such as distributing I/O loads by relocating the LogicalVolume data with the highest IOPS in the relevant RAIDGroup to a RAIDGroup with less disk utilization or a newly created RAIDGroup.

7.6.3 CM load erro r

Report detail:
CM XX in the high load state was detected.

The monitoring states of other presumably related targets are as follows:
<Monitoring state of each CM> <- (1)
[CM0x00] High load state detected
[CM0x01] Not detected
[CM0x10] Not detected
[CM0x11] Not detected

Related graph:
Refer to the CM utilization graph.
Refer to the IOPS graph of each LogicalVolume. <- (2)

Guidelines for corrective measure:
1. I/O may be concentrated on the RAIDGroups under control of the same CM <- (2)
By referring to the monitoring state of each CM, distribute I/O to RAIDGroups under control of a CM with a low load.
2. If the monitoring state of every CM is "Monitoring" or "High load state detected," the number of CMs is probably too small to meet every I/O request.
Consider adding or upgrading hardware.

(1)

Indicates the states of other CMs during the same time zone as the relevant CM detected a high load state.

High load state detected	An alarm log indicating a load error in the relevant CM is generated during the same time zone.
Monitoring	The CM is monitoring for alarm detection because the threshold has been exceeded several times although no events have been detected as an alarm.
Not detected	The relevant CM did not cause a bottleneck during the same time zone.

(2)

If only the relevant CM detected a high load state, I/O access is unevenly concentrated on the CM.
Refer to the IOPS graphs of RAIDGroups and LogicalVolumes and take measures such as distributing I/O loads of each CM.

CM is put in high load state when a transaction is in contention with advanced copy.
In this case, review the operation mode so that advanced copy is executed in a time slot in which a transaction load is relatively small. Alternatively, if the same CM controls both the copy source and target volumes (RAIDGroups), use separate CMs to control the source and target volumes.

7.6.4 Port throughput load erro r

Report detail:
Port X in the high load state was detected.
The maximum transfer rate of the relevant port is 1 Gbps.

Related graph:
Refer to the Port X throughput graph.

Guidelines for corrective measure:
1. I/O is probably concentrated on the same port.
Check the setting of the path of the relevant port, or consider adding a path switch. <- (1)

(1): An I/O load is concentrated on the relevant port. Access is probably biased to the same port due to a change in logical paths or setting errors made during expansion. Examine the port load balance by referring to the send/receive graphs of all ports of the relevant switch.

7.7 Definition File

The following items can be set up in the set up file perf.conf for performance management.

The number of days to retain performance data in the performance management
The number of generations of log files (perflog.*)

Refer to "Appendix D.4 perf.conf parameter" for the settings.

Contents Index

	Performance information (Unit)	Fibre Channel switch
Port	Transfer rates of send/receive data (MB/S)	supported
Port	Number of CRC errors	supported

Chapter 7 Performance Management

7.1 Overview

7.1.1 Performance Information Types

7.1.2 Performance Graph Window Types

7.1.3 Threshold Monitoring Types

7.2 Flow of Performance Management

7.2.1 Checking disk space on the administrative server

7.2.2 Instruction for performance management

7.2.3 Setting monitoring intervals

7.2.4 Starting the performance management window

7.2.5 Displaying performance information of the selected device

7.2.6 Displaying Fibre Channel switch performance information

7.2.7 Displaying storage performance information

7.2.7.1 Displaying LUN and RAIDGroup performance information

7.2.7.2 Displaying disk (physical drive) performance information

7.2.7.3 Module performance view

7.2.8 Operating graph windows

7.2.9 Examples of use of performance management

7.2.10 Instruction for stopping performance management