Why is the concept of VMotion during the day is ok for the enterprise

As I stated in an earlier post that the

concept of VMotion/LiveMigration wasn’t that important since clients don’t change when they do work on hardware even with VMotion

I’ll cover why this is such a flawed notion for a serious enterprise.  I’ve also had an excellent comment that goes along with these same assumptions.  To start off I believe that these conclusions would be the same ones I would make if I had the same assumptions.

The assumptions that Microsoft makes is that the only time you ever work on anything is during these distinct change windows.   So you don’t make any changes at all outside of these windows.   You don’t optimize performance.  You don’t replace failed hard drives in RAID array.   You don’t launch new web servers to enhance performance as load grows.  You don’t turn off servers that are no longer in use.  You don’t turn on new servers.

Now that’s obviously significantly silly for any serious enterprise.   Any company with more than say 20 servers is going to be doing things all the time.  You are introducing an increasing risk to your company the longer you wait to do any of those activities.  You’ll follow ITIL with the change control rules.   You’ll try to optimize your teams efforts so you can do things spread out to maximize efforts with minimal or ideally no impact to the business (accidents happen).  In doing this you’ll likely follow the ITIL categorization of different activities.  This is a key thought here.

ITIL has the concept of Routine Maintenance in Service Operations.   Any type of IT Operational process which has been performed enough times with near zero impacts caused due to the activity can be considered viable to perform at most any time.  Every change no matter how small has risk.  The objective is to weigh what the risk is to what the impact is if you don’t do this specific change.   Your mileage will vary depending on your company and procedures and experiences obviously.

An example of a Process is the replacement of a failed hard drive in a RAID 5/6 Array.   This has been done some 500 times by my team over the past year.   We know the impact when we do this and if we don’t do this.  We have had one issue two years ago when we replaced the hard drive and the RAID rebuild started another drive failed at the same time.   My company has ruled that this activity has a significantly low enough risk level that it can stay at Routine Maintenance and done at anytime we can do it (preferrably as soon after notification of issue as possible as not doing this replacement introduces a new increasing risk over time of complete system failure).

Often the only way to get a Process to Routine Maintenance is a promotion path:

  1. Do the change that you want to do during very strictly watched and managed change windows.
  2. Do this a couple dozen times.
  3. Get approval to start doing this change more loosely during more change windows.
  4. Do this a couple hundred times.
  5. If this has worked 100% of the time, submit this change to be considered Routine Maintenance.  If approved Celebrate.  If not start over at step one and keep working at it if you believe it to be worth the effort.

So.. After all this build up.. Here’s the punch line.   If I don’t have a technology available to me I can’t start doing it during change windows in the first place.   My ultimate goal is to have technology available to me to prove its worth and make it part of my day to day activities.   Microsoft Marketing turned this around on its head and approaching it its self serving viewpoint since Microsoft doesn’t have LiveMigration yet.   Its hypocritical history says when Microsoft has LiveMigration, the Marketing will start saying they are “Helping Change the Way IT Does Its Business With LiveMigration”.

In my environment VMotion has been performed over 50,000 times in the past year.  It is considered Routine Maintenance and is used on a daily basis to provide better service to my clients.  It took me close to a year to get VMotion to Routine Maintenance and was worth every single effort.  Now VMware DRS that is now enabled since VMotion has such a fantastic history at my company I don’t have to manage a cluster since DRS handles quite a bit of the performance issues for my team now.

So Yes, I do use VMotion daily at all hours.   Yes, I would update my hypervisor layer with a critical patch like an equivalent MS08-67 starting immediately upon emergency Change Authorization.   Finally, Yes it would be done faster than a product like QuickMigration since I can do it and not impact my companies services (No interruption to my network connections).   This is my goal at the end of the day – maintain my companies uptime running Windows Server while provding the most risk free environment at the best cost possible.