Lately I have been running benchmarks on storage solutions that I have built myself using open source file systems such as ZFS. I needed to compare the performance on the platform when I incremented or used different types of disks, changed RAID levels, added SSDs for caching, etc… With that in mind I started looking for a good benchmarking utility that could provide me with the information needed to evaluate a configuration change. Latency, Average MB/s, and Average IOPS are the three main measurements that I was concerned with. I came across Iometer which is capable of providing me with this information and allows me to customize the access specifications so that I could simulate different types of workloads e.g. VDI, Web, File, SQL, Exchange, etc… With that said, there is plenty of good articles out there on how to configure Iometer and I will cover some of the basics but my main point in writing this article was to aggregate all the different access specifications so that they are available for people wanting to use them since I had to hunt them down from different sites.

The latest release of Iometer runs on Linux and Windows-based Operating systems. The tool itself comes with two applications, “Dynamo.exe” and “IOmeter.exe”. To give you some background on the two programs that Iometer consists of and what each one does I have copied the description directly from the user manual.

“Iometer is the controlling program. Using Iometer’s graphical user interface, you configure the workload, set operating parameters, and start and stop tests. Iometer tells Dynamo what to do, collects the resulting data, and summarizes the results in output files. Only one copy of Iometer should be running at a time; it is typically run on the server machine.

Dynamo is the workload generator. It has no user interface. At Iometer’s command, Dynamo performs I/O operations and records performance information, then returns the data to Iometer. There can be more than one copy of Dynamo running at a time; typically one copy runs on the server machine and one additional copy runs on each client machine.

Dynamo is multithreaded; each copy can simulate the workload of multiple client programs. Each running copy of Dynamo is called a manager; each thread within a copy of Dynamo is called a worker.”

You will notice that on startup Iometer will have a number of workers or threads. By default you should see one but you can add more by clicking the “start a new disk worker on selected manager” button. For this example I have configured mine for four.

001

002

Adding more threads will allow you to have more workers that can simulate a certain workload and will allow you to drive your storage solution to its limits. This is useful if you have shared storage that will run a multiple VMs and each VM might have a different workload. You could concurrently have one worker test a database access specification, the others a web server access specifications, and the last one could be something such as VDI. This will simulate different types of workloads on the same shared storage to see how it performs when multiple workloads are happening simultaneously.

You should experiment by starting with 1 worker and increment it until you see worse performance (Higher latency, lower IO, etc…) at which point you should know what the limits of your storage is.

Note: It is not a good idea to have more workers than the amount of physical cores in your system since you might experience a performance hit due to thread context switching on the test system. Additionally, you might get inaccuracies in your results if your CPU utilization on your test system is high during testing. Make sure to close any programs that might drive your CPU to its limits while running Iometer.

The next setting that you should focus on is the maximum disk size. The maximum disk size will tell Iometer what the size of the test file should be. When it comes to the test file size it is recommended to make it ten times bigger than the amount of memory in your system. The reason for this recommendation is that you don’t want calls to the storage subsystem to be intercepted and cached by the operating system, host(if virtualized), or storage cache. Otherwise, if you set this to a small amount e.g. 100MB then chances are that you are never hitting the storage subsystem since all your reads and writes might be occurring in memory and you will get incorrect results.

Note: When you enter the maximum disk size amount you will have to convert from sectors/blocks to gigabytes. This site has a good calculator for doing so. In my example 41943040 translates to 20GB.

The target section should be self-explanatory and list the drives or LUNs attached to this computer. Logical drives (A partition on a disk) are yellow and if you see a red slash then that means that this logical drive does not have a test file(The size of the test file is determine by the maximum disk size value). Physical drives that don’t have partitions or are not formatted are colored in blue. Disks that are colored in blue do not need the test file to be created and the test can start immediately.

Note: If you use a yellow disk then delete the test file once you are done so that it doesn’t consume space.

Below is a good description for when you might want to test a disk without a file system. This came from a blog post in the VMware communities forum and has some very good information.

“When choosing the system to test in the topology frame, the “Disk  Targets” tab will provide options as to the storage target.  The options  here include formatted disks (yellow) or unformatted disks (blue).  In  the former case Iometer address the storage through the OS’s file system  (FS).  In the latter, direct calls are made to the hardware without  using a FS.  Storage specialists are usually more interested in just the  hardware so evaluation of unformatted LUNs (blue) is preferable.  There  is some cost of virtualizing the OS’s interface to the disk through the FS so formatting the disk with the correct FS and testing the yellow  target can be helpful.”

The # of Outstanding I/Os is very important and should be set to something that is greater than 1 because otherwise you will not be stressing the storage subsystem. The manual has the following description:

“The # of Outstanding I/Os control specifies the maximum number of outstanding asynchronous I/O operations per disk the selected worker(s) will attempt to have active at one time.”

Assuming that you have 4 workers all testing against the same target with the # of Outstanding I/Os set to 16 as in my example then you will have 4 workers * 16 outstanding I/Os per disk * 1 Disk = 64 outstanding I/Os at a time. I did a couple of test on an old intel SSD model SSDSA2M080G2GC to show the importance of this parameter. Leaving everything else constant and modifying the # of Outstanding I/Os the following results were achieved.

003

The chart above shows you the Average IOs/sec achieved using different access specifications which are detailed near the end of this article. Across the bottom I tested 3 different scenarios for each access specification. The three scenarios are 16, 32, and 64 outstanding I/Os. It is clear that incrementing the number of outstanding I/Os to 32 allowed me to achieved better performance for almost all test scenarios. This tells me that 16 outstanding I/Os wasn’t pushing this SSD to its limit. At 64 outstanding I/Os my results become worse for the majority of the tests. Moving onto the latency numbers I had the following results.

004

These numbers actually give you a better overview of why 64 outstanding I/Os in this case is not doing much for me. The latency numbers for every test at this value are double if not triple compared to testing at 32 outstanding I/Os. Going from 16 outstanding I/Os to 32 gives me a small bump in latency which is expected but not a drastic increment as jumping to 64.

Lastly, the numbers for average MBs/sec mirror the results of the average IOs/sec and at 64 outstanding I/Os the performance is worse for the majority of the tests.

005

What these test run indicate is that in order to squeeze the maximum performance out of this SSD  I must use a value of 32 Outstanding I/Os. You will find that most people recommend using 32 as a value for Outstanding I/Os when testing storage.

If you are testing on a VM and you are running ESXi using the software iSCSI initiator then you can verify the queue depth of a LUN by enabling SSH on your host and running the ESXTOP command. You have to press the letter ‘u’ on your keyboard to switch over to the disk device view. When you are in the disk device view press ‘f’ and select ‘Queue Stats’. The DQLEN column value is the queue depth of the storage device.

0006

What this means is that setting the # of Outstanding I/Os per target to be greater than 32 will have no effect since my Device/LUN queue is set to be 32. Additionally, a VM also has a default queue size of 32. If you need more information on VMware storage Queues the following two articles/presentations(one and two) do a great job at explaining this information.

Note: If you are testing with multiple workers make sure that all the changes are made on the computer object itself and not the individual workers. Making changes on the computer object will propagate them down to the workers. If you want your workers to have different settings from each other then customize each one individually.

The last important setting here is the “Write IO Data Pattern” option. There is three different settings to choose from. The descriptions below come from the projects change log.

Data randomization – 3 pull-down menu choice are available in IOmeter to select the type of data buffer utilized. Each will have a different effect on system performance and memory utilization foot print. Options are:

“Repeating bytes” — Each IO having a unique set of repeating bytes (ex: First IO is all A’s, second IO is all B’s). The command to command delay is slightly (~12%, but can be longer in high IO environments) longer with this configuration.

“Pseudo-Random” — A buffer being filled with random data and is reused for every write IO. Note: This was the behavior of all IOMeter versions up to and including 2006.07.27

“Full random” — A 16MB buffer is generated by IOMeter which is filled with random data. Each write IO randomly indexes into this buffer. Note: The 16MB buffer may impact performance tests that are sensitive to memory utilization. Drives that implement dedupe compression may still compress this data.

From my understanding you would choose a write IO Data pattern of full random if you want controllers that perform compression or inline deduplication to not have an unfair advantage over other units that might not do so if you are comparing the capabilities of different units. If you are benchmarking a storage unit or an SSD to figure out how much it can handle and the controller is capable of doing deduplication or compression then it would make sense to not negate these features since your production workloads will be taking advantage of them.

In the network tab I did not make any changes but in the third tab I have created some access specifications based on information gathered from the following articles(one, two, and three).

Workload I/O Size Read % Write % Random % Sequential %
Windows 7 4 KB 20 80 80 20
Web File Server 4 KB 4 KB 95 5 75 25
Web File Server 8 KB 8 KB 95 5 75 25
Web File Server 64 KB 64 KB 95 5 75 25
Decision Support System DB 1 MB 100 0 100 0
Media Streaming 64 KB 98 2 0 100
SQL Server Log 64 KB 0 100 0 100
OS Paging 64 KB 90 10 0 100
Web Server Log 8 KB 0 100 0 100
OLTP DB 8 KB 70 30 100 0
Exchange Server 4 KB 67 33 100 0
Video on Demand 512 KB 100 0 100 0
Backup 256 KB 100 0 0 100
Restore 256 KB 0 100 0 100
Max Read IOPS 512 B 100 0 0 100
Max Write IOPS 512 B 0 100 0 100

0007

The test setup tab has settings that deal with the run time and ramp up time of each test. Normally the ramp up time is recommended to be set to 1 minute as this gives spinning disk a chance to warm up. Additionally, when it comes to the run time for each test you want to run it long enough so that any outliers get negated. What this means is that you get better results by running each test longer. I usually do 3-5 minutes per test as this gets me good measurements. Don’t forget to set ‘record results’ to ‘all’ so that all the data gets written to a CSV file.

0008

The ‘Results Display’ tab will show us results live as the test are executing. The ‘update frequency’ controls how often the results refresh. You generally do not want to set this to be lower than 3 seconds as it might consume CPU cycles on your PC and it might skew the results.

0009

With regards to the different measurements there are three that are of importance. ‘Total I/Os per Second’ which is just a measurement of operations that can be done per second. The higher the operations that can be done per second the better performance that you will see on your end. ‘Total MBs per Second’ is a function of (I/Os * Transfer Request Size) and it measures throughput, that is how many megabytes we are transferring per second for a specific transfer request size. Lastly, the ‘Average I/O Response Time’ is an average of how long each I/O operation takes from start to end.

I think that this a good point to wrap up this article. In my second/third part of this series of articles I will be looking at other disk benchmarks that are specific to platforms such as Microsoft and VMware. I will also look at ways to create access specifications based on the current workloads in your environment to help you better estimate how well your current workload will perform on another storage subsystem. As usual, thank you for reading this article and feel free to leave any comments.

Leave a Reply