IOPS Testing and a View Environment

IOPS – Input/Output Operations Per Second – is a key factor when designing a View solution (more important than storage space even, given the various tools that deal with storage consolidation). That’s old news. What I’ve been having fun doing recently is trying to determine just what kind of IO I can expect from the various LUNs that have been carved from my SAN. I’m using pretty standard techniques here that would work for any server analysis, however I’m applying them to my virtual desktops in order to ensure that my storage can support the number of desktops that I plan to load up.

We’ve probably all seen tables that reveal per disk information in a table like this (taken from Wikipedia):

Device

Type

IOPS

Interface

Notes

7,200 rpm SATA drives

HDD

~75-100 IOPS[2]

SATA 3 Gb/s

10,000 rpm SATA drives

HDD

~125-150 IOPS[2]

SATA 3 Gb/s

10,000 rpm SAS drives

HDD

~140 IOPS [2]

SAS

15,000 rpm SAS drives

HDD

~175-210 IOPS [2]

SAS

Those tables are always marked as “ballpark” numbers and can certainly be good for getting an off the cuff estimate. That said, when it comes to real world application, they’re rubbish. IOPS is a hugely variable metric that depends highly on the nature of the transactions. For example, on one LUN, just by varying the Read/Write ratio my resulting IOPS varied by a factor of 10 (the array’s cache certainly contributed to this variance). Varying the randomness (vs. sequential data access) can have similar results. Changing the transaction size is another way to game the metric. When it comes down to it, you’ve got to monitor your environment and set up tests that are similar to the actual transactions that you observe. That means that you have two distinct challenges – discover something that can roughly be called a “normal” IO load, and find a tool that can give you feedback about your storage.

To discover what is “normal” for your desktops, I’d recommend using a tool like Liquidware Labs Stratusphere. It’ll give you all sorts of awesome baselines, including IO, application startup time, compute resources and will even do some basic analysis. Failing that, at least set up a test desktop and perform some monitoring of a real live user. Make sure that it has a common set of baseline applications and see if you can get a user to actually use it for a day. I advocate that you test a virtual desktop (rather than their actual physical desktop) because vCenter 5 has some great metric reporting tools built into it, although perfmon can be used with a physical desktop in a pinch (use the PhysicalDisk: Disk Reads/sec and Disk Writes/sec metrics). I prefer to use an actual View linked clone (assuming that that’s going to be the production solution) because they can exhibit different IO characteristics than a physical desktop would.

To discover this data in vCenter, look at your test machine’s Disk: “Average read requests per second” and “Average write requests per second” metrics during an appropriate test period. I would not recommend looking a full day’s average, as we’re only interested in periods of activity for that machine. Get an idea of the ratio between Reads and Writes (at this customer, I noticed a 1:2 Read:Write ratio on the test machine, which is wildly different from the literature’s 9:1 Read:Write examples). I examine those metrics on the LUN object under the VM metrics, as that should also capture the read hits from the replica disk. You can sanity check this data by looking at the LUN itself (as an aggregate of all VM activity) to ensure that you’re not missing anything or wildly off base with your numbers. That gives you your IOPS information, as well as your read:write ratio. Adding your “Average read requests…” and your “Average write requests…” together, will give you your total IOPS required for a single user.

So, how do you determine your average transaction size (so that you can run accurate tests/benchmarks)? You’ll have to do a little math. In vCenter that can be discovered by looking at your Read rate and Write rate metrics. If you want to find the average size of a read transaction, simply divide the Read Rate by the Read IOPS. Do the same for the Write Rate and the Write IOPS, and take a weighted average based on the observed Read:Write ratio. There you have it – you’ve discovered your total IOPS, your read:write ratio as well as an average transaction size. With those bits of data in hand, you’re ready to move onto the next challenge: figuring out just what you can do with your storage and if it actually meets your needs.

The first tool that I use for this is task is an array calculator (calculating by hand is pretty complex when RAID gets involved). It helps me to do some predictive work, figuring out RAID levels and spindle count. I like to use the array calculator at http://www.wmarow.com/strcalc/. I enter my disk information, set my read/write ratio and transaction size, and then I play around with different RAID configurations. You’ll have to talk to your SAN guys to get an idea about appropriate entries for the Read/Write cache hit percentages, but this can at least help you to understand the IO impact of various RAID configurations and spindle counts. Once you have an idea about what sort of LUN you’re going to need and have it created, you can set up some test VMs on it and see how it plays out in reality.

In order to stress my LUNs and see what I can actually get from them, I use Iometer. I’ll usually put together a few VMs and add their results, just to ensure that I’m not running into any bottlenecks with a single VM. When setting up the test in Iometer (under the “Access Specifications” tab), you’ll want to be sure to specify the “Transfer Request Size” that was found earlier and set the “Percent Read/Write Distribution” as well. I like to look at the worst case scenario and so I look at 100% Random requests. When you’ve got 50+ desktops on a single LUN, even data access from a single VM that would be sequential is pretty much random by the time it gets to the LUN, because many VMs are requesting transactions at the same time (unless you’re using some tool to optimize your transactions).

With your test configured in Iometer, you’ll need to go to the Results Display tab. I’d recommend changing the “Update Frequency” to about 10 seconds – the default of infinity means that it will wait an infinite amount of time before updating the display. This means that, until you stop your tests, you won’t get any results (although you’ll have a better average on the display). Once it’s configured, start your tests by pressing the green flag and wait for the results to roll in. The first time that Iometer runs, it will have to do some prep work that can take a while – just let it do its thing. Eventually, you’ll get to load up some awesome stress on the system and that will give you actual, real life metrics about what your LUN is capable of providing for you under load. The load pattern is based on an actual user’s usage and so will reflect what you will see in production. By looking at the IO Capacity of the LUN (as revealed by Iometer) and the usage of that original user, you can get an idea about the number of users that can be successfully supported by that LUN.

Comments

Popular posts from this blog

Clone a Standard vSwitch from one ESXi Host to Another

PowerShell Sorting by Multiple Columns

Deleting Orphaned (AKA Zombie) VMDK Files