Cloud Computing Solution Provider – VMware

The recent Zimbra acquisition by VMware threw me a bit for a loop initially.  Then I started chewing on it and read the good post by Rodney Haywood.   Very shortly afterwords I had a classic Homer Duh moment.

VMware aims to build from the ground up the best cloud computing solution for sale as possible.   That is taking into account that cloud computing definition today is as about as vague as a real cloud in the sky.  Today that cloud is fluffy and in 5 mins that cloud is shaped like a rabbit.   As such they have built a pretty strong infrastructure level for customers with vSphere, vCenter and various add-on tools.   They have picked up SpringSource to offer ultimately a platform for services and understanding of how the JVM interacts more closely with the hypervisor.   Now they are getting into the services space with Email/Calendaring.

  • IAAS -> vSphere/vCenter
  • PAAS -> SpringSource
  • SAAS -> Zimbra

Each of these areas is really focused on a different customer base at the end of the day.  Sure you can say IT and that’s like saying your customer base for is for the TV viewing audience.    It is too vague and there is better & a more definable end customer grouping.

  • IAAS -> Server/Storage/PC/Hardware Teams – Ground Level System Admins
  • PAAS -> Development Teams making solutions up – Architects/Developers
  • SAAS -> Back Office management/utilities – Often more visible by the CxOs.

So where are they going next and what areas are missing for the full suite for all the different customer bases they are aiming for?

Limits have their limits

I’ve been chewing on this post by Duncan at Yellow Bricks for the past month and a half.  It covers some complicated issues that one has to deal with in a enterprise size environment with many assumptions on what gets you into this mess in the first place.   The best thing to do is downscale and upscale as needed based on good performance monitoring and bottleneck research.  Thankfully I’ve managed to make good relationships with most teams where I work that this has become the standard operating procedure though sometimes we just can’t.   At the end of the day the issue boils down to the simple goal:

“As the VMware environment administrator, how can I make better use of what I have available to me?”

For my environment I run into a variety of political reasons going from..

  • “I am going to need that extra 2 CPUs someday in the future so I can’t give them up now.”
  • “The vendor docs say I really do need 8 CPUs and 128G of RAM for my 3 users even though 126G is unused.”
  • “Someone on your team said I really do need that 8G of RAM so I won’t give it up”
  • “Oh come on.. what’s another 2G of RAM”
  • “I gave up my budget for a physical to do this as a virtual even though I’m still spending less in the grand scheme.  Gimme more resources.”

to the begging

  • “Pleaseeee.  I think it’ll help my issues.  It might even make me look better to my co-workers.”

I have two distinct use cases that really showcase that this kind of capability can be a hard item to use.

Case #1:  The poorly written VBscript

Back in the early Windows 3.1 days when VB was a novel concept, some developers made this ground breaking app that would pull data from a remote system, massage the data a bit and put it into a centralized Btrieve database.   Well this script that they wrote goes to sleep for a minute after the remote system’s queue it checks is empty.  This script sleep function checks the clock to see if a minute has passed.  It constantly checks the clock which consumes 100% of the CPU all the time.   This wasn’t much of an issue when each one of these systems was on its own old PC system.  We virtualized them since 16 XP workstations in the datacenter is a management headache.   Now that’s 16 high power, multiple generation newer cores being used 100% all day long for no good reason.

We, VMware Admins, have discovered that on the old PCs these systems would easily take 5-10 mins to work through their work queues.   On the newest hardware we have with these as VMs, it takes under 15 seconds to do the same work.   So for 60 seconds it is doing nothing except checking the hardware clock.

Solution #1:  CPU limits good

We implemented a CPU limiting resource pool for these VBscript VMs.   They are still running mega fast in comparison to where they were a year ago.   Now they are using no more than 8 cores worth at any given time.  A big improvement until the app developers decide if they are going to replace all that code with sleep 60 or recode the entire app.

Case #2:  vCenter SQL Server Memory Limits

Due to a feature in vCenter 4.0U1 and ESX 3.5 Hosts, when I increased the RAM on my vCenter dedicated SQL Server from 4G to 8G, a Memory limit was set of 4G.   When I would go onto the SQL instance, SQL Server.exe would only be using about 3600 Megs yet all 8G was consumed/used.   This screamed to me an issue with the OS instance.   After close to 10 days of head beating and not understanding why my brand new vCenter 4.0U1 system was running so poorly, a co-worker with a fresh set of eyes noticed this setting on the SQL Server instance.

Solution #2:  Memory limits bad

This is obvious.  We disabled the limit and the SQL Server performance went through the roof instantly. We simply couldn’t tell easily that the driver was using 4G of RAM as it wasn’t a process.  Nobody noticed the ballooning happening.

At the end of the day there’s pros and cons to having this level of capabilities.  This is why I like ESX and the general approach of VMware.   Give you everything we can in terms of options, configurations and rope to hang yourself and two of your friends.   We will attempt to automate this and hide this as much as we can.   The Vendor will never know all the situations we, people in the field, are going to run into so let’s give us all the options they can.  Use that rope with caution.

http://www.amazon.com/gp/feature.html/ref=amb_link_86250151_1?ie=UTF8&docId=1000453281&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=right-1&pf_rd_r=1RRCWNZDTV8MFM1WDEGE&pf_rd_t=101&pf_rd_p=503481191&pf_rd_i=163856011

New vmware.com HomePage Layout & View 4 is released

New vmware.com HomePage is now live.   I had an “anonymous internet tipster” give the heads up last night.   Looks good and a bit more sleek in fitting with the branding of the new vmware logo.

Along with that View 4 is finally released.   I’ve been playing with some beta bits for a while now and the PCoIP is pretty impressive catchup with ICA protocol.   I’m looking forward to the mass quantity of comparisons that are going to come out now between ICA & PCoIP.

VI Client protects itself nicely

Ok.. Follow me on this one. 

I am connecting from a laptop via View client to a Virtual Workstation running XP that then I launch VI Client on it and go to the console of my Virtual Workstation.  

VI Client Console
VI Client Console

In the old day VirtualCenter would just loose it’s little mind and crash horribly or do some really funky things with feedback loops.   I like it.    

VMworld 2009 – Day 2 Wrapup

Day two at VMworld ended up being quite a bit more exciting than yesterday.  The keynote by Steve Herrod was much more what I expected from the keynotes.   He covered some of the “cool” stuff coming down the pipes in both the short term and longer term.   The PCoIP demo showing Google Earth zooming up and down while connected to a machine in Portland, OR from the Moscone Center rocked.   I want to have that to use while I’m sitting in the hotel room’s blazing fast speeds while attempting to do something useful on one of my machines at home instead of using RDP with SSL.  

I went to the IO DRS Tech Preview and got the same excitement I’ve had from previous years where you know your seeing something innovative.  Several of the other sessions I hit were really partner style presentations that did not say much.   So a good 25/75 day for sessions which is pretty good.  

Now that the Self Service Labs were finally working properly I gave a shot at the vCenter Orchestrator product offering.  The Lab was responsive and well documented.  It was pretty nice and really hinted at the power this system can offer for DataCenter Automation.  The theory is this is free with vSphere 4 so I’m going to have to really look into that and find out.

During my open times during the day I had some good meetings with some VMware employees to discuss some of the vStorage & vCloud directions, HP folks around OpenView and Virtualization tools, AMD & Intel on their functionality futures and  Hitachi around their multipathing technology for VMware (still no roadmap).    

The Party was fun.   Foreigner still knows how to rock and I can actually climb rock walls.   The nice thing about the party this year is it was right at the Moscone Center.

A very productive and long day.

EA3196 – Virtualizing BlackBerry Enterprise on VMware

Once again.. another session I didn’t sign up with and zero issues getting into. 

To start off RIM & VMware have been working together for 2 years and it is officially supported on VMware.   Together RIM & VMware have done many numerous and successful engagements running BES on VMware.  The interesting thing is RIM runs their own BES on VMware for over 3 years now. 

Today BES best practice is no more than 1k users per server and they are not very multi-core friendly.   It is not cluster aware or have any HA built in.   The new 5.0 version of BES is coming with some HA availability via replication at the application layer.   One thing that has been seen in various engagements is if you put the BES servers on the same VMware Hosts as virtualized Exchange, there are noticable performance improvements. 

The support options for BES do clearly state that they support on VMware ESX.  

One of the big reasons to virtualize BES is that since it can not use multi-cores effectively the big 32 core boxes today are only able to use a fraction.  By virtualizing BES can get significant consolidation.   Then when doing the virtualization BES gets all the advantages of running virtual such as Test/Dev deployments and server consolidation and HA etc.   Things that are well known and talked about already.  

BES encourages template use to do rapid deployments.   The gotcha is just what your company policies and rules are and can potentially save quite a bit of time.   This presentation is really trying to show how to use VMware/Virtualization with BES for change management improvements, server maintenance, HA, component failures and other base vSphere technologies.   VMware is looking towards using Fault Tolerance for their own BES servers. 

BES is often not considered Tier 1 for DR events.   Even though email is often the biggest thing needed to start working after a DR event to start communications.   The reason is generally been seen due to the complexity and cost of DR. 

The performance testing with the Alliance Team from VMware has been successfully done numerous times for the past couple of years.   They have done testing at both RIM & VMware offices.   The main goal of these efforts was to generate white papers and a reference architectures that are known to work.   The testing was to use Exchange LoadGen & PERK load driver (BES testing driver).  Part of this is how to scale outwith more VMs  as the scale up is known. 

The hardware was 8 cpus, Intel E5450 3Ghz, 16 G RAM and FAS3020 Netapp on vSphere 4 & BES 4.1.6.  The 2k user test with 2 Exchange systems the results were 23% CPU utilization on 2 vCPU BES VMs.   Latency numbers was under 10 ms.   Nothing majorly wrong seen in the testing metrics.   Going from ESX 3.5 to vSphere 4 was a 10-15% CPU reduction in the same workload tests.   Adding in Hardware Assist for Memory saw what looks like another 3-5% reducting in CPU usage.   In their high load testing when doing VMotion there is a small hiccup of about 10% increase in CPU utilization during the cut over period of the VMotion.   This is well within the capacity available on the host and in the Guest OS. 

Their recommendation is to do no more than 2k users on a 2vCPU VM.  If you need more then add more VMs.   Scales and performs well in this scale out architecture.   Be sure you give the storage the number of spindles needed.   The standard statement when talking about virtualization management.  

 The presenter then went into a couple of reference architecture designs.  Small Business & Enterprise with a couple different varieties. 

BES @ VMware.   3 physical locations, 6,500 Exchange users.   1k of them have 5G mailboxes and the default for the rest are 2G.   BES has become pretty common.   They run Exchange 2007 & Windows 2003 for AD & the Guest OS.   Looks fairly straight forward. 

4 prod BES VMS, 1 STandby BES VM, 1 Attachment BES VM and 1 BES dedicated Database VM.   Done on 7 physical servers and 40 additional VM workloads on this cluster.

TA3461 – IO DRS: Tech Preview for VM Performance Isolation

This is a very new area of research at VMware.  Only about 2 years ago.  Since thm is is a Tech Preview it has no roadmap for when it will be available.

The Problem:

Many different workloads hit the same set of disks/arrays/spindles etc.   Low priority processes that run ad-hoc or other times will cause higher priority systems to experience an impact.  What you want to see is that the low priority VM gets less performance than the higher priority systems.   The question is how you can do this?

A solution:  Resource Controls

Assigning out shares based on disk performances.   Just like CPU/Memory shares of the original ESX days.  Higher shares total for a host gets higher priority for that shared VMFS volume.

To configure this you’d go into the VM and set the shares.   Fairly straight forward.   The setting is shares and then the limiting factor is IOPS.   Interesting idea. 

First case study covers two separate hosts with the IO DRS turned on running the same workload levels and saw a pretty significant difference in terms of IOPS & Latency measures.   With it turned off both VMs ran at 20 ms & 1500 IOPS.  With it on the Latency changed to 16 ms and 31 ms and a similar spread for IOPS.   Nice..

Case study two is a a more serious one with SQL server running.   The shares were 4:1 and the ratios were not that in terms of performance.  The thing that they are seeing is that load time matters significantly.  Overall thruput is working right and good though the loads make a big difference.  

The demo went and showed changing the shares on the fly and the Limit for IOPS and watched the IOMeter machines adjust immediately.   When limiting the IOPS the other systems picked up the slack and got more performance.  

After showing the demo the presenters asked if anyone in the packed room (and I do mean PACKED) would find a value to this?   Everyone immediately raised their hands.  

The tech approach is first to detect congestion.   If latency is above a threshold and then trigger the IO DRS.   If it isn’t borked don’t fix it.   IO DRS works by controling the IOs issued per host.  The sum of the vms on the host with IO DRS enabled is compared with other hosts to determine share priority.   So first the host is picked and then the VMs shares on that host are prioritized and then back to the host discussion.   The share control goes against all hosts using that same VMFS volume. 

IO slots are filled based on the shares on each host.  There are so many IO slots per Host.  This is how the IOs are controled for share congestion work.  

Two major performance metrics in storage industry.   Bandwidth (MB/s) and Throughput (IOPS).   Each have their pros and cons.   Bandwidth helps workloads with large IO sizes and IOPS is good for lots of sequential workloads.   IO DRS controls the array queue among the VMs.  Then if a VM has lots of small IOs they can continue to do things and have high IOPS.  Conversely if it has large IOs it is doing then it will get high bandwidth and low IOPS using the same share control system.

Case studies and test runs have shown that Device level latency stays the same as workloads change.  Some tests have shown that with IO DRS IOPS can go up simply due to the workloads involved.  Control of the IOs allows all to work though depending on the workload a VM can accomplish more.  

The key understanding is that IO DRS really helps when there is congestion.   When things are good and latency is not high enough to trigger the system, the shares are not used.   If a high IO share system is not using its slots, they are reassigned to other VMs in the cluster. 

The gain overall is the ability to do performance isolation amoung VMs based on Disk IO.

In the future they are looking to tie this into more vStorage APIs and VMotions and Storage VMotions, IOP reservation potentially etc.

Rocking cool and can’t wait for this to come out.

VMworld 2009 – Keynote P4

vSphere is the basis of all the improvements and technology over the years.  Based on Software Mainframe (for those of you over 40), the Cloud (for the under 40 crowd) and decides the best idea is to call it The Giant Computer.   The reason this all works is because of VMotion.   It is the basis of all that has happened.

The reason for the success of VMotion is Maturity, Breadth, Automated Use.

Maturity of VMotion – Estimates (fun or not) put around 360 million VMotions around the world since VMotion started.  About 2 VMotions a second around the world.   VMotion is 6 years old.   (Wow I feel old)

Breadth of VMotion – Storage  & Network VMotioning.   Across protocols and soon across Datacenters.   High performance computing systems are starting to look at using VMware.  

Automation of VMotion – DRS is the initial version that made this work.   DRS has been shown to average 96% of a perfect performance environment compared to a manually setup cluster in a perfect world.     Future will include IO DRS shares and configuration based on IOPS.    DPM allows for power optimization across the datacenter.   Or as has been said a Server Defrag capability.  

vSphere is still driving ahead.. more next post.

VMworld 2009 – Keynote P3

View also includes the Mobile Technology dicussion.   Mobile Technology is longer term working for functionality.    Visa Product Development is up on the stage.   He sees this space as a huge innovation going forward.    Current development is significantly complicated.   Easing functionality for development is extremely interesting for Visa. 

The Visa demo uses Windows Mobile on a developer version of a phone (kinda big) running an Atom CPU.   The presentation shows some alerting from Visa transactions and finding local ATMs.   The impressive zing is that the Visa demo application is actually an Android app running on the Atom CPU.   Wow.  

Next..

VMworld 2009 – Keynote P2

A major goal of the View initative is to have the same image while providing the best experience possible with WAN, LAN and Direct machine speeds.    For WAN/LAN the solution will be PCoIP.   The performance numbers are very impressive and no numbers.    This protocol has shown some excellent capabilities over WAN connections.

The other piece for local machine usage is Employee Owned machines.   Hosted Virtualization is being highly developed.    Deals with Intel have gone the next step with Bare Metal Virtualization for Corporate owned machines.   

Demo of the Bare Metal Virtualization (type 1 hypervisor).   Direct3D works fairly well during the demo.   A presentation of OpenGL using the Google Earth demo over PCoIP and over the LAN was very nice.  The WAN demo back to Portland simply rocked.  

Wyse has an iPhone application to make the iPhone act as a thin client connecting over PCoIP to the same virtual machine ruled.   Quick and effectively to scroll around the screen and do what you would normally.   Which is appropriate having seen well over 2 out of 10 people having iPhones, more than Blackberrys here.

More to come.. next post…