vSphere 4 – The Next Great Thing

It’s official. There’s about a zillion blog postings and news articles coming out about the next generation of ESX.

In watching the press conference yesterday the one thing that really hit me is that this is the next game changer. Cloud computing has been stuck for years in lock-in approaches. Amazon EC2, Google App Engine, Microsoft Azure are development environments that you really need to develop to for your apps to work. The designs and setup have to be extremely customized to function. This is the main reason cloud computing hasn’t taken off full scale. No cross vendor solutions. No ability to take my application out of the box deploy to the cloud system and it just works. No way for me to honestly develop in house and then move it easily to the cloud.

vSphere 4 (or Cloud Infrastructure) is the first Cloud Computing solution that I have seen that doesn’t lock you into a specific vendor. I can run Microsoft Azure on top of vSphere 4 (woah!). It can run on my infrastructure of commodity parts (yeah.. that white box I have can be a cloud computing solution). I can set it up, test it, run it in my basement, then deploy it up to a cloud provider with more bang and power than I have.  I can develop on linux, suse, freebsd, windows, or solaris and have the whole thing packaged up as a deliverable tool. No middleman. This is a powerful concept. This is the game changer.

The reason Microsoft OSes took off in the early ’90s is they were simply the easiest and most accessible development environment out there.  Today Linux/Java/Web is taking most of that development energy by storm.  It costs me nothing to develop solutions on those products.   If I can setup a development environment of my own without having to pay some thousands of dollars just to get started with the tools, I can make the next facebook/twitter/ebay.  I don’t have to be a corporation to develop a solution.

VMware gets that and Paul Maritz was a key component of that understanding at Microsoft. Welcome to the next great thing.

Cisco joins the Server Market

Today’s big news:

Cisco has come out with a single stop full solution rack for CPU, Disk & Network in one using all the best of virtualization technology of Storage, Server & Networking.   Tight VMware integration, all Cisco hardware & lots of virtualization technology at 10G.

Over the past several years I’ve kinda figured that Cisco has lost their way.  Floundering a bit.  Now I know someone at Cisco has a brain.   VMware has blazed the path in recent history into the Enterprise Datacenter.  The gotcha  has been trying to integrate with networking and storage.   This is in theory resolved by this solution of a fully integrated delivered product line.

So how does HP & IBM & Dell correspond with this?

OpenSolaris 2008.11 & ESX (Security Part 2)

If you are using OpenSolaris and NFS for your datastores and using ESX you need to share out your zfs filesystems with anon=0 since ESX wants to write to the NFS datastore as root.

zfs set sharenfs=anon=0 usbpool/virtuals

I wouldn’t mind having stuff like this if I could figure out how to properly get logging of the issues/connections in OpenSolaris. Anybody know how to increasing logging for the NFS services in OpenSolaris?

Cisco’s now a full service stop

Cisco’s throwing their gloves into the ring for server hardware. Now they can offer Server Hardware, Datacenter Experts (strong in Server Virtualization with their investments into VMware) and Networking Hardware.  All they need now is to offer some Storage Virtualization Appliance on their Server Hardware and they can start to offer the whole datacenter in a shipping container.

Next Up:  Commodity Datacenters.   Everything the Cloud wishes it could be.

root == Bad

I’ve been reading a bunch of posts that have been covering the idea of “making another root account”.   I read this KB article from VMware when it first came out and said “how dumb, no thanks”.   I didn’t realize it would cause such a stir.

When I design new systems and deploy new applications and processes at my work, a large part of the discussion in my mind is how much support work will this new procedure cause.   Initial deployment is typically all a project planner thinks about.  That is a small cost in the overall picture from my experience.

Adding an additional root level account introduces the following support issues:

  • An account that has to be audited
  • An account that has to have a regular password update which means tracking that password
  • An account that needs to have the password distributed to various individuals
  • An attack vector that must be considered or contained

Anyone that immediately says these are trivial never has had to maintain this for thousands of accounts from the top down to the actual account.  It might be simple to do though when you start adding this to every procedure you have, it adds up.

Now why would you need another root account?  I’m doing everything I can to get rid of all usage of it.  sudo does 99.99% of everything I need to do with root level privileges in ESX.   If I could add a host into VirtualCenter using a user account instead of root I’d be happy to disable logins using root.  There is only one situation I can think of that I need root for and if the host is that screwed up, I’ll most likely be rebooting it anyways.

Not much use for root honestly.  Fight the power.   As a fellow blogger says so elegantly, “Just cause you can doesn’t mean you should”.

vSphere here we come

Ever since VMworld 2008 I’ve been waiting on the official words on what the new VI4 version name will be.   I figured it’d be changed from VMware VI4 which was the latest name.  Just wasn’t sure what it would change to.

Du Du DAAHAHAAAAA

VMware vSphere

This according to vmblog.com

Makes sense.   Just curious when the official announcement will come.

Powershell speed – Get-VM vs. Get-Type -ViewType

I’ve been starting to look at using the VI Toolkit which uses Powershell.   In doing this many of the command formats tend to be “Get-VM | Get-View” or “Get-VMHost | Get-View“.   So I’m off and figuring this out and I run a small script and say “Geez that took a long time to run”.   I’m talking to my co-worker (a pretty smart cookie) and he says “Why don’t you just use “Get-View -ViewType VirtualMachine” and skip the middle man?”   Good point.  Didn’t know about that command.  Well this is just a tad bit faster.

Get-VM | Get-View timing in my script takes 1 minute and 37 seconds.

Get-View -ViewType VirtualMachine takes an amazing 5.12 seconds.

The VI Toolkit developers have identified this as a serious issue are working on ways to speed this up and retain backwards compatibility.

So the lesson today is if you need to do a Get-View immediately after doing some set collection look at using the Get-View -ViewType instead.  It isn’t as readable though it gets the job done well.

Citrix vs VDI/View – Round 1

Recently my company has gone through this huge architectural discussion/debate around using Citrix versus using VDI.   It has been rather entertaining to say the least.   I’ve met with the architect that’s attempting to put together a comparision document on a direction to go.

Some background… We’ve worked to get Citrix to run our critical home grown apps 3 times now.   All since Citrix is going to save us some untold amount of money in the long run.  Each time trying to get our homegrown apps to Citrix-ize has been completely and horribly unsuccessful.   So great.. We have Citrix running some 2 dozen applications (not the important ones still) and still have to maintain a separate application deployment system for applications outside of Citrix currently (lets talk about management, labor and support effort of maintaining two separate systems).  The current Citrix environment gets at least a dozen new tickets

When we attempted to get Citrix up and running the first time 4+ years ago we spent some 200 hours trying to get One homegrown business critical application up and running on Citrix.   That failed horribly.   We then looked at running this “Virtual Machine” concept with XP workstations which was completely new at the time.  It was this or putting some 200 desktops into our datacenter.  Ewwww.   We went and tried the Virtual Workstations out and had the application, entire environment and all systems up and running in about 100 hours of effort.   It worked.  The environment then grew organically as more and more teams heard about it and found that it just kept working.   We are up around 1200 VDI instances (we call it Virtual Workstations) now.

Sooo.. Back to architectural discussion..

Citrix has some 12 pages in this document around the things we’d have to fix and all the unknowns and estimated effort to Citrix-ize all these apps and possible, maybe cost savings if we are lucky and can get some of these business critical applications up and working and so forth.    Virtual Workstations has 1/2 a page that says basically.. “It works and will work for the forseeable future.”

What is upper management thinking about doing?  Citrix.   Why?  Beats the tar out of me as the possible cost savings just don’t appear to be there.     *sigh*

Why is the concept of VMotion during the day is ok for the enterprise

As I stated in an earlier post that the

concept of VMotion/LiveMigration wasn’t that important since clients don’t change when they do work on hardware even with VMotion

I’ll cover why this is such a flawed notion for a serious enterprise.  I’ve also had an excellent comment that goes along with these same assumptions.  To start off I believe that these conclusions would be the same ones I would make if I had the same assumptions.

The assumptions that Microsoft makes is that the only time you ever work on anything is during these distinct change windows.   So you don’t make any changes at all outside of these windows.   You don’t optimize performance.  You don’t replace failed hard drives in RAID array.   You don’t launch new web servers to enhance performance as load grows.  You don’t turn off servers that are no longer in use.  You don’t turn on new servers.

Now that’s obviously significantly silly for any serious enterprise.   Any company with more than say 20 servers is going to be doing things all the time.  You are introducing an increasing risk to your company the longer you wait to do any of those activities.  You’ll follow ITIL with the change control rules.   You’ll try to optimize your teams efforts so you can do things spread out to maximize efforts with minimal or ideally no impact to the business (accidents happen).  In doing this you’ll likely follow the ITIL categorization of different activities.  This is a key thought here.

ITIL has the concept of Routine Maintenance in Service Operations.   Any type of IT Operational process which has been performed enough times with near zero impacts caused due to the activity can be considered viable to perform at most any time.  Every change no matter how small has risk.  The objective is to weigh what the risk is to what the impact is if you don’t do this specific change.   Your mileage will vary depending on your company and procedures and experiences obviously.

An example of a Process is the replacement of a failed hard drive in a RAID 5/6 Array.   This has been done some 500 times by my team over the past year.   We know the impact when we do this and if we don’t do this.  We have had one issue two years ago when we replaced the hard drive and the RAID rebuild started another drive failed at the same time.   My company has ruled that this activity has a significantly low enough risk level that it can stay at Routine Maintenance and done at anytime we can do it (preferrably as soon after notification of issue as possible as not doing this replacement introduces a new increasing risk over time of complete system failure).

Often the only way to get a Process to Routine Maintenance is a promotion path:

  1. Do the change that you want to do during very strictly watched and managed change windows.
  2. Do this a couple dozen times.
  3. Get approval to start doing this change more loosely during more change windows.
  4. Do this a couple hundred times.
  5. If this has worked 100% of the time, submit this change to be considered Routine Maintenance.  If approved Celebrate.  If not start over at step one and keep working at it if you believe it to be worth the effort.

So.. After all this build up.. Here’s the punch line.   If I don’t have a technology available to me I can’t start doing it during change windows in the first place.   My ultimate goal is to have technology available to me to prove its worth and make it part of my day to day activities.   Microsoft Marketing turned this around on its head and approaching it its self serving viewpoint since Microsoft doesn’t have LiveMigration yet.   Its hypocritical history says when Microsoft has LiveMigration, the Marketing will start saying they are “Helping Change the Way IT Does Its Business With LiveMigration”.

In my environment VMotion has been performed over 50,000 times in the past year.  It is considered Routine Maintenance and is used on a daily basis to provide better service to my clients.  It took me close to a year to get VMotion to Routine Maintenance and was worth every single effort.  Now VMware DRS that is now enabled since VMotion has such a fantastic history at my company I don’t have to manage a cluster since DRS handles quite a bit of the performance issues for my team now.

So Yes, I do use VMotion daily at all hours.   Yes, I would update my hypervisor layer with a critical patch like an equivalent MS08-67 starting immediately upon emergency Change Authorization.   Finally, Yes it would be done faster than a product like QuickMigration since I can do it and not impact my companies services (No interruption to my network connections).   This is my goal at the end of the day – maintain my companies uptime running Windows Server while provding the most risk free environment at the best cost possible.

So how important is LiveMigration/VMotion now?

One of Microsoft’s big marketing statements I’ve heard several times is that LiveMigration wasn’t that important since clients don’t change when they do work on hardware even with LiveMigration.   I’ll cover why this in depth on why this is a flawed thought for an enterprise company in a future blog entry.

Along comes a critical use case this past week.  MS08-67 came out and threw most companies I know of into some serious chaos while they rolled this patch out ASAP.  Now this one does impact any Windows OS including Server Core.   Anyone that would be using Hyper-V would obviously be affected right now.    Let’s walk through trying to deploy this for 120 Hyper-V hosts with Quick Migration (which causes a service interruption) as fast as humanly possible with business buy-off to do this ASAP outside of Maintenance Zones.  Lets assume we are talking about a patch that ONLY affect Virtualization Hosts (I know I know.. not realistic with Hyper-V, bear with me).

Hyper-V Scenario with Quick Migration:

Assumptions to setup:

  • Standard business Day is 6am to 10pm.    So with the business units agreement we can do this patch from 10pm to 6am every day which is 8 hours of work.
  • Applying the patch takes 30 mins including reboot and checkout time.  Need to apply to both the primary and secondary Host since any real enterprise has HA setup and configured since its “Defined as Free by Microsoft” right?
  • Each Hyper-V Host has 20 Servers on it and there is 120 Host Pairs.
  • There’s a fail over host available for each of these 120 Hosts (Let’s not talk about amazingly wasted resources.  This is being generous to Microsoft here and we fail over only once each cluster.)
  • Each Host takes about 30 mins to “Quick Migrate” all 20 Server Virtual Machines and one person can do 4 hosts at a time without incurring other unplanned outages.
  • All Hands on Deck for part of our team and some folks are awake during normal work hours for support.  Lets say reasonably 6 people are working on this.

6 people * 4 Hosts per person per hour gives us roughly 24 hosts and their fail-over pair getting updated each hour and a half.   240 hosts divided by 24 gives us 10 hours to do all these migrations at a rush with a staggered patch start time by about 7 mins for each server.  Also each person is perfect in their execution.  That’s not unreasonable considering connect time to console and login times.

This doesn’t take into account the issues with the business units that are dependent on your services:

  • Apps that don’t work right with a Quick Migration and don’t check out right
  • Hit to your team’s morale.
  • Hit to your team’s reputation for using this Virtualization Solution.
  • People aren’t perfect and make mistakes, patches don’t always apply right.

VMware Scenario with DRS & VMotion:

  • Put the Host that needs the patch into Maintenance Mode.   If the cluster is large enough do two at the same time.
  • Apply the patch to the Host and reboot it.
  • Check the Host out and Release it for usage.  Take it out of Maintenance Mode.
  • Repeat until every Host is finished.

I have been able to do a full rushed patch deployment like this in my environment with an average of 30 Servers per Host in about 6 hours by myself.

We would start this patch application immediately upon notification since VMotion does not cause a network outage or service interruption.   The window of potential infection is incredibly small at this point as I don’t wait for a maintenance zone and start the update immediately on the Hosts.

So the question for a real enterprise how much is this worth?   For me its pretty obviously worth it.   No downtime.   No service impact.   Just a continiously available service for my clients who don’t have to care about the latest patch.