Ivan has recently written a couple of posts that have inspired me to put on paper some thoughts about broadcast domain sizing.
We all intuitively know that a "too big" broadcast domain is a problem. But how big is too big, and what are the relevant metrics?
There was a time when servers did lots of irresponsible broadcasting, but in my experience, stuff that's installed in today's virtual data centers is much better behaved than the stuff of a decade ago.
...and just what is a "broadcast storm" anyway? Most descriptions I've read are describing something that can be much better categorized as a "bridging loop". If dozens or hundreds of servers are producing broadcast frames, I benevolently assume that it's because the servers expect us to deliver the frames. Either that, or the servers are broken, and should be fixed.
I have a background in supporting trading systems that regularly produce aggregate broadcast/multicast rates well in excess of 1Gb/s, and that background probably informs my thinking on this point. The use of broadcast and multicast suppression mechanisms seems heavy handed. What is this traffic exactly? Why is the server sending it? Why don't I want to deliver it? QoS is a much more responsible way to handle this problem if/when there really is one.
Whew, I'm well off track already!
The central point here is that I believe we talk about the wrong things when discussing the size and scope of our L2 networks. Here's why.
Subnet size is irrelevant.
I used to run a network that included a /16 server access LAN. The subnet was shockingly full, but didn't really need a full /16. A /18 would have worked, but /19 would have been too small. The L2 topology consisted of Catalyst 2900XL switches (this was a while ago). Was it "too big?"
No. It worked perfectly.
There were only about 100 nodes on this network, and no server virtualization. Each node had around 100 IP addresses configured on its single NIC.
The scheme here was that each server had the potential to run each of 100 different services. I used the third octet to identify the service, and the fourth octet to identify the server. So service 25 on server 27 could be found at 10.0.25.27 (for example).
Broadcast frames in this environment were limited to the occasional ARP query.
Host count is irrelevant.
I expect to make a less convincing case here, but my point is that when talking about a virtualized data center, I don't believe we should care about how many (virtual) hosts (IP stacks?) share a VLAN.
The previous example used lots of IP addresses, but only had a small number of hosts present. Now I want to flip the proportions around. Let's imagine instead that we have 10,000 virtual machines in a VLAN, but were somehow able to virtualize them into a single pair of impossibly-large servers.
Is there any problem now? Our STP topology is ridiculously small at just two ports. If the ESX hosts and vSwitches are able to handle the traffic pushed into them, why would we be inclined to say that this broadcast domain is "too big?"
Broadcast domain sizing overlooks the impact on shared resources.
Almost all discussions of broadcast domain sizing overlook the fact that we need to consider what one VLAN will do to another when they share common resources.
Obvious points of contention are 802.1Q trunks which share bandwidth between all VLANs. Less obvious points of contention are shared ASICs and buffers within a switch. If you've ever noticed how a 100Mb/s server can hurt seven of his neighbors on a shared-ASIC switching architecture, you know what I'm getting at.
Splitting clients into different VLANs doesn't help if the switching capacity isn't there to back it up, but discussions of subnet sizing usually overlook this detail.
How are the edge ports configured?
Lets imagine that we've got a VMware vSwitch with 8 pNICs connected to 8 switch ports.
If those 8 switch ports are configured with a static aggregation (on the switch end) and IP hash balancing (on the VMware end), then we've got a single 8Gb/s port from spanning tree's perspective. If the environment has background "noise" consisting of 100Mb/s of garbage broadcast traffic, then the ESX host gets 100Mb/s of garbage representing 1.25% of its incoming bandwidth capacity. Not great, but not the end of the world.
If those 8 ports are configured with the default "host pinning" mechanism, then the switches have eight 1Gb/s ports from spanning tree's perspective. The 100Mb/s of garbage is multiplied eight times. The server gets 800Mb/s of garbage representing 10% of its incoming bandwidth capacity. Yuck.
This distinction is important, and completely overlooked by most discussions of broadcast domain size.
So, what should we be looking at?
Sheesh, you want answers? All I said was "we're doing it wrong", not "I have answers!"
First, I think we should be looking at the number of spanning tree edge ports. This metric represents both the physical size of the broadcast domain and the impact on our ESX hosts.
Second, I think we should be talking about density of VLANs on trunks. Where possible, it might be worth splitting up an ESX domain so that only certain VLANs are available on certain servers. If the environment consists of LOTS of very sparsely populated VLANs, then an automagic VLAN pruning scheme might be worth deploying.
Third, I think we need to look at what the servers are doing. Lots of broadcast traffic? Maybe we should have disabled netbuei? Maybe we shouldn't mingle the trading floor support system with the general population?
I don't have a clear strategy about how to handle discussions about broadcast domain sizing, but I'm absolutely convinced that discussions of host count and subnet size miss the point. There's risk here, but slicing the virtual server population into lots of little VLANs obviously doesn't fix anything that a network engineer cares about.
Do you have an idea about how to better measure these problems so that we can have more useful discussions? Please share!