Scale Up or Scale Out™
Duncan over at Yellow-Bricks.com brings up the great discussion once again. Every time a brand new piece of hardware comes out with more RAM possible or better, faster CPUs I have the "Scale Up or Scale Out™" Discussion with many people. I have this discussion every 9-12 months on average. We end up covering all sorts of criteria on what to compare and what is acceptable and what is not.
Our conversation usually goes something like this:
The hot new badness just came out and we need to order more hardware.
Awesome. So how much does this puppy have in it? RAM? CPUs? Slots for HBAs & NICs?
Did you know the new motherboard comes with 4 NICs now so our standard config can go from 4U to 2U and gobs of RAM with 6 core CPUs now.
Awesome! *pause* You know with that much RAM I can put 100 Win7 VDI systems on there. Umm.. What about when it goes down?
Oh.. Hrm. That wouldn't be so good. ....
That being said we generally end up breaking it down to a couple of factors.
- What is the current capacity configuration we run with today?
- What is our current pain points in CPU, Memory, Network or Storage?
- Is there any new architecture changes coming that will impact this design? Is there a new switch fabric that needs to be plugged into? Is there changes to storage that need to be addressed?
- How much does this new hardware configuration cost?
- How will this change affect DRS's Chaos Theory? The more hosts, the more DRS can do for you due to Chaos theory.
- What is our Risk level for number of eggs in a single basket?
The point is most corporation's environments aren't starting from scratch. In my case we have a known configuration today to use as a baseline and adjust the environment and design every hardware order to make it better.
In our most recent order we had this discussion all over again. This time we had some architectural changes needed to prevent some false positive HA events from happening in a 2 time a year strange events. So we are going to a 3 switch connectivity solution to enable network beacon for NIC teamed connections. We started with the following information:
- Baseline: HP DL585 G5, 4 sockets w/ quad cores, 128G of RAM, 3 Dual 1G NICs, 2 Emulex LPe11000 HBAs
- Cluster: 10 Host Clusters with ~30 per Host in Servers and ~65 Workstations per Host in View
- Pain Points: CPU starvation, Licensing Issues with 10 Host sized clusters
- Risk Level: Politically we are getting pretty touchy about more than 30 Servers going down in a single blow even if HA works on bringing them up in under 15 mins automatically.
We compared 3 different models of newer, faster, badder and more wicked hardware from HP since the DL585 G5s are not really on the manufacturing line anymore. So we looked at the BL495cG6, DL585 G6, DL385 G6 and DL580 G6.
DL585 G6:
- Pros
- Proven and comfortable AMD based stable platform with a good price/performance cost.
- Gain more CPU resources with the additional 2 cores per socket. 6 core systems.
- Can build 5 Host clusters to address licensing issues. Issues with HA support for the density involved.
- Cons
- Same Risk Level as before.
Push- Same architectural solution today with maybe another NIC card to enable the NIC Beaconing
DL580 G5:
- Pros
- Fastest individual cores out there. Lots of good press about the Intel.
- Should get better CPU resources with higher performing CPUs.
- Can build 5 Host clusters to address licensing issues. Issues with HA support for the density involved.
- Cons
- Significant premium in cost for speed. See easily a 25% premium for a 10% faster performance.
- Same Risk Level as before.
- Push
- Same architectural solution today with maybe another NIC card to enable the NIC Beaconing.
DL385 G6:
- Pros
- Lowers the risk level without lowering performance
- Best price/performance cost for 6 core systems
- Has enough slots to move to the newer network layout to enable NIC Beaconing
- Gain more CPU resources with the additional 2 cores per socket. 6 core systems.
- Put 64G of RAM into them and build 5 host clusters for licensing problematic applications.
- Cons
- More physical hosts to deal with (cabling, power, rack space, cooling, management)
BL495c G6:
- Pros
- Blades reduce the amount of cabling
- Gain more CPU resources with the additional 2 cores per socket. 6 core systems
- Cons
- Firmware Management is an issue
- Increases our Risk Level with more eggs in the same basket unless we get multiple chassis to spread the blades across
- New solution from the ground up running ESX on blades
- Not ready to support Flex10 and because of this we have limited NIC capabilities to fit our requirements
We decided to go with the DL385 G6s based on these criteria. We will dedicate a specific 5 Host cluster for problem children applications with licensing issues. The RAM size of the hosts will limit the number of VMs we can end up putting in a cluster which addresses the Risk Level of number of VMs per Host. We are still way ahead of the game using VMware so having to have a couple more physicals for all these improvements is not an issue.
In your company or solution something else may be more appropriate. The key in an ongoing improvement mentality is have things you can measure and then criteria on what to change along with why. There is no one size fits all answer which is why VMware works so well for so many different folks. We don't have to change how we do things to gain a lot of flexibility in the Datacenter while not changing how we ultimately end up managing these systems.