TA3461 - IO DRS: Tech Preview for VM Performance Isolation
This is a very new area of research at VMware. Only about 2 years ago. Since thm is is a Tech Preview it has no roadmap for when it will be available.
The Problem:
Many different workloads hit the same set of disks/arrays/spindles etc. Low priority processes that run ad-hoc or other times will cause higher priority systems to experience an impact. What you want to see is that the low priority VM gets less performance than the higher priority systems. The question is how you can do this?
A solution: Resource Controls
Assigning out shares based on disk performances. Just like CPU/Memory shares of the original ESX days. Higher shares total for a host gets higher priority for that shared VMFS volume.
To configure this you'd go into the VM and set the shares. Fairly straight forward. The setting is shares and then the limiting factor is IOPS. Interesting idea.
First case study covers two separate hosts with the IO DRS turned on running the same workload levels and saw a pretty significant difference in terms of IOPS & Latency measures. With it turned off both VMs ran at 20 ms & 1500 IOPS. With it on the Latency changed to 16 ms and 31 ms and a similar spread for IOPS. Nice..
Case study two is a a more serious one with SQL server running. The shares were 4:1 and the ratios were not that in terms of performance. The thing that they are seeing is that load time matters significantly. Overall thruput is working right and good though the loads make a big difference.
The demo went and showed changing the shares on the fly and the Limit for IOPS and watched the IOMeter machines adjust immediately. When limiting the IOPS the other systems picked up the slack and got more performance.
After showing the demo the presenters asked if anyone in the packed room (and I do mean PACKED) would find a value to this? Everyone immediately raised their hands.
The tech approach is first to detect congestion. If latency is above a threshold and then trigger the IO DRS. If it isn't borked don't fix it. IO DRS works by controling the IOs issued per host. The sum of the vms on the host with IO DRS enabled is compared with other hosts to determine share priority. So first the host is picked and then the VMs shares on that host are prioritized and then back to the host discussion. The share control goes against all hosts using that same VMFS volume.
IO slots are filled based on the shares on each host. There are so many IO slots per Host. This is how the IOs are controled for share congestion work.
Two major performance metrics in storage industry. Bandwidth (MB/s) and Throughput (IOPS). Each have their pros and cons. Bandwidth helps workloads with large IO sizes and IOPS is good for lots of sequential workloads. IO DRS controls the array queue among the VMs. Then if a VM has lots of small IOs they can continue to do things and have high IOPS. Conversely if it has large IOs it is doing then it will get high bandwidth and low IOPS using the same share control system.
Case studies and test runs have shown that Device level latency stays the same as workloads change. Some tests have shown that with IO DRS IOPS can go up simply due to the workloads involved. Control of the IOs allows all to work though depending on the workload a VM can accomplish more.
The key understanding is that IO DRS really helps when there is congestion. When things are good and latency is not high enough to trigger the system, the shares are not used. If a high IO share system is not using its slots, they are reassigned to other VMs in the cluster.
The gain overall is the ability to do performance isolation amoung VMs based on Disk IO.
In the future they are looking to tie this into more vStorage APIs and VMotions and Storage VMotions, IOP reservation potentially etc.
Rocking cool and can't wait for this to come out.