Scale Up or Scale Out™
Duncan over at Yellow-Bricks.com brings up the great discussion once again. Every time a brand new piece of hardware comes out with more RAM possible or better, faster CPUs I have the “Scale Up or Scale Out™” Discussion with many people. I have this discussion every 9-12 months on average. We end up covering all sorts of criteria on what to compare and what is acceptable and what is not.
Our conversation usually goes something like this:
The hot new badness just came out and we need to order more hardware.
Awesome. So how much does this puppy have in it? RAM? CPUs? Slots for HBAs & NICs?
Did you know the new motherboard comes with 4 NICs now so our standard config can go from 4U to 2U and gobs of RAM with 6 core CPUs now.
Awesome! *pause* You know with that much RAM I can put 100 Win7 VDI systems on there. Umm.. What about when it goes down?
Oh.. Hrm. That wouldn’t be so good. ….
That being said we generally end up breaking it down to a couple of factors.
- What is the current capacity configuration we run with today?
- What is our current pain points in CPU, Memory, Network or Storage?
- Is there any new architecture changes coming that will impact this design? Is there a new switch fabric that needs to be plugged into? Is there changes to storage that need to be addressed?
- How much does this new hardware configuration cost?
- How will this change affect DRS’s Chaos Theory? The more hosts, the more DRS can do for you due to Chaos theory.
- What is our Risk level for number of eggs in a single basket?
The point is most corporation’s environments aren’t starting from scratch. In my case we have a known configuration today to use as a baseline and adjust the environment and design every hardware order to make it better.
In our most recent order we had this discussion all over again. This time we had some architectural changes needed to prevent some false positive HA events from happening in a 2 time a year strange events. So we are going to a 3 switch connectivity solution to enable network beacon for NIC teamed connections. We started with the following information:
- Baseline: HP DL585 G5, 4 sockets w/ quad cores, 128G of RAM, 3 Dual 1G NICs, 2 Emulex LPe11000 HBAs
- Cluster: 10 Host Clusters with ~30 per Host in Servers and ~65 Workstations per Host in View
- Pain Points: CPU starvation, Licensing Issues with 10 Host sized clusters
- Risk Level: Politically we are getting pretty touchy about more than 30 Servers going down in a single blow even if HA works on bringing them up in under 15 mins automatically.
We compared 3 different models of newer, faster, badder and more wicked hardware from HP since the DL585 G5s are not really on the manufacturing line anymore. So we looked at the BL495cG6, DL585 G6, DL385 G6 and DL580 G6.
DL585 G6:
- Pros
- Proven and comfortable AMD based stable platform with a good price/performance cost.
- Gain more CPU resources with the additional 2 cores per socket. 6 core systems.
- Can build 5 Host clusters to address licensing issues. Issues with HA support for the density involved.
- Cons
- Same Risk Level as before.
Push
- Same architectural solution today with maybe another NIC card to enable the NIC Beaconing
DL580 G5:
- Pros
- Fastest individual cores out there. Lots of good press about the Intel.
- Should get better CPU resources with higher performing CPUs.
- Can build 5 Host clusters to address licensing issues. Issues with HA support for the density involved.
- Cons
- Significant premium in cost for speed. See easily a 25% premium for a 10% faster performance.
- Same Risk Level as before.
- Push
- Same architectural solution today with maybe another NIC card to enable the NIC Beaconing.
DL385 G6:
- Pros
- Lowers the risk level without lowering performance
- Best price/performance cost for 6 core systems
- Has enough slots to move to the newer network layout to enable NIC Beaconing
- Gain more CPU resources with the additional 2 cores per socket. 6 core systems.
- Put 64G of RAM into them and build 5 host clusters for licensing problematic applications.
- Cons
- More physical hosts to deal with (cabling, power, rack space, cooling, management)
BL495c G6:
- Pros
- Blades reduce the amount of cabling
- Gain more CPU resources with the additional 2 cores per socket. 6 core systems
- Cons
- Firmware Management is an issue
- Increases our Risk Level with more eggs in the same basket unless we get multiple chassis to spread the blades across
- New solution from the ground up running ESX on blades
- Not ready to support Flex10 and because of this we have limited NIC capabilities to fit our requirements
We decided to go with the DL385 G6s based on these criteria. We will dedicate a specific 5 Host cluster for problem children applications with licensing issues. The RAM size of the hosts will limit the number of VMs we can end up putting in a cluster which addresses the Risk Level of number of VMs per Host. We are still way ahead of the game using VMware so having to have a couple more physicals for all these improvements is not an issue.
In your company or solution something else may be more appropriate. The key in an ongoing improvement mentality is have things you can measure and then criteria on what to change along with why. There is no one size fits all answer which is why VMware works so well for so many different folks. We don’t have to change how we do things to gain a lot of flexibility in the Datacenter while not changing how we ultimately end up managing these systems.
In: VMware · Tagged with: capacity, hypervisor, ScaleOut, ScaleUp, VMware

on 21 March 2010 at 6:49 am
Permalink
Hi Ian,
You put fimware management is an issue with BL495, can you elaborate?
I’ve read Duncan’s post as well, my though is that since virtualization came into the game, the notion of ‘all eggs in a single basket’ is blured, the boudaries are dynamic and expanding out (vs stretching in). Yesterday you would be flamed down if you say that you run 3 VMs on a single host. Today people would say that’s a waste of resources.
We could get a math geek in and compute the probability, my bet is that scaling out at the same time you scale up decreases significantly the risk associated with all in one basket, don’t you ?
on 21 March 2010 at 9:07 am
Permalink
1 think you may need to compare with are the HA capability. Generally before virtualization in place, all the standalone machine do not entitle the cluster features which allow them to prevent hardware failure. A part replacement for standalone system could take up few hours easily before it could up and running again. Even you have 30 VM down at 1 time, and fully restarted within next 15 mins, you are still provide a better SLA to the business. If they need more uptime, you can upsell the FT, or operating system cluster on top of the OS layer in the virtual machine. Just my 2 cents.
on 22 March 2010 at 9:30 am
Permalink
@PiroNet Firmware management on the blade chassis is a major challenge if you have a variety of systems running on there, say 8 or 16 or 32 depending on your blades in it. In order to update the Onboard Administrator or Virtual Connect Firmwares you must first update the BIOS on all of the blades, then you can update the iLOs and then the OA and then the VC. In an enterprise environment, you can’t easily get a single maintenance zone to take down 8 or 32 systems so this can take months at best with scheduling and organization. So you encounter an issue with the VC or OA (which has happened several times to me in the past year) which is fixed in an update of the firmware. Yet that update doesn’t support the iLO 1.70 that I’m running currently and if I update that iLO to 1.78 for supported version of the OA that I’m going to fix the issue in VC, I have known issue with the BIOS version so I have to update that. It is a complete domino system that is very difficult to work with.
Now if I ran VMware on all the blades in a single chassis it wouldn’t be nearly as big of a deal as long as I design to prevent too many eggs in a single basket. Regardless of how redundant something is, it will go down. Not a matter of if, just a matter of when. Hopefully you never see that day, yet when that day comes are you willing to take the fall for that design? So to do a serious VMware cluster you want to spread a given cluster across multiple chassis at a bare minimum. That’s not exactly cheap at the end of the day unless your buying 16 BL495s at the get go and don’t mind having the risk of 1/2 of an 8 host cluster going down at any given time. This would address the single point of failure discussion.
The “Eggs in a Basket” is not blurred. It is still very clear. Virtualization does not suddenly add more capabilities to the guest OSes. Just because it is virtual doesn’t mean you don’t have single points of failure. The virtual is just another layer in the discussion. VMs run in a single location at any given time. They are the point of impact for an IaaS environment concept.
My math and statistics have shown so far that scaling out does reduce our overall system impact at time of outage event. It does cost more though in the grand scheme if you don’t calculate in the outage impact to your business or clients. That can often even the cost discussion pretty quickly when a given VM is down for more than 15 mins cause it takes 30 mins to come or something and that runs the business $200k/hour in lost sales. Then well.. having 2 more hosts out there instead of 1 big host, sure can make that discussion moot quickly.
on 22 March 2010 at 9:37 am
Permalink
@Craig HA just means you can return to service automatically and hopefully faster. We do provide better uptime with VMware products. We have proven this time and time again. This however doesn’t negate the level of overall impact when I loose a host with 60 VMs on it. It takes time to bring them back up, to get the applications working properly again etc. I would love to upsell FT and will for some apps once it can support 2 vCPUs at a time.
As for adding a tack on Operating System Cluster on top of that actually has caused more outages than it helps for most of these apps. It adds extra complexity to a running application for what? Return to service faster? VMware HA is what we use for that.
This is often resolved by finding the reason a given application can not handle a hard crash and working on getting that fixed so it can restart cleaner in a crash state.
The thing to think about is when 60 business facing critical applications go down at one time, that’s not just a single Severity Priority 1 ticket, its 60. That’s a huge hit on the reputation. Not a good thing. We can stomach 30 a little easier. Especially when you talk about the fact that I saved $500k by sticking all these critical apps into VMware. We can do the numbers and compare outage costs for these applications versus the capital and operational costs of the Infrastructure. If an app costs the company $100k/hour in lost sales and we only saved $20k per year…. Not a really good trade off if it goes down. We have to balance the impact.
on 24 March 2010 at 8:50 am
Permalink
[...] ages-old discussion of scale up vs. scale out is revisited again in this blog post. I guess the key takeaway for me is the reminder that while VMware HA does restart workloads [...]
on 3 April 2010 at 2:17 am
Permalink
[...] caught Steve’s eye: The ages-old discussion of scale up vs. scale out is revisited again in this blog post. I guess the key takeaway for me is the reminder that while VMware HA does restart workloads [...]
on 6 April 2010 at 4:30 pm
Permalink
[...] caught Steve’s eye: The ages-old discussion of scale up vs. scale out is revisited again in this blog post. I guess the key takeaway for me is the reminder that while VMware HA does restart workloads [...]
on 16 April 2010 at 6:42 am
Permalink
[...] scaling up due to the release of the new Intel 5600 series that has six cores. Which set off a blog posting by Ian Koenig at http://itsjustanotherlayer.com/ titled scale up or scale out in which he brings up [...]