Tuesday, December 6, 2011

Thinking about sizing the broadcast domain

Ivan has recently written a couple of posts that have inspired me to put on paper some thoughts about broadcast domain sizing.

We all intuitively know that a "too big" broadcast domain is a problem.  But how big is too big, and what are the relevant metrics?

There was a time when servers did lots of irresponsible broadcasting, but in my experience, stuff that's installed in today's virtual data centers is much better behaved than the stuff of a decade ago.

...and just what is a "broadcast storm" anyway?  Most descriptions I've read are describing something that can be much better categorized as a "bridging loop".  If dozens or hundreds of servers are producing broadcast frames, I benevolently assume that it's because the servers expect us to deliver the frames.  Either that, or the servers are broken, and should be fixed.

I have a background in supporting trading systems that regularly produce aggregate broadcast/multicast rates well in excess of 1Gb/s, and that background probably informs my thinking on this point.  The use of broadcast and multicast suppression mechanisms seems heavy handed.  What is this traffic exactly?  Why is the server sending it?  Why don't I want to deliver it?  QoS is a much more responsible way to handle this problem if/when there really is one.

Whew, I'm well off track already!

The central point here is that I believe we talk about the wrong things when discussing the size and scope of our L2 networks.  Here's why.

Subnet size is irrelevant.
I used to run a network that included a /16 server access LAN.  The subnet was shockingly full, but didn't really need a full /16.  A /18 would have worked, but /19 would have been too small.  The L2 topology consisted of Catalyst 2900XL switches (this was a while ago).  Was it "too big?"

No.  It worked perfectly.

There were only about 100 nodes on this network, and no server virtualization.  Each node had around 100 IP addresses configured on its single NIC.

The scheme here was that each server had the potential to run each of 100 different services.  I used the third octet to identify the service, and the fourth octet to identify the server.  So service 25 on server 27 could be found at 10.0.25.27 (for example).

Broadcast frames in this environment were limited to the occasional ARP query.

Host count is irrelevant.
I expect to make a less convincing case here, but my point is that when talking about a virtualized data center, I don't believe we should care about how many (virtual) hosts (IP stacks?) share a VLAN.

The previous example used lots of IP addresses, but only had a small number of hosts present.  Now I want to flip the proportions around.  Let's imagine instead that we have 10,000 virtual machines in a VLAN, but were somehow able to virtualize them into a single pair of impossibly-large servers.

Is there any problem now?  Our STP topology is ridiculously small at just two ports.  If the ESX hosts and vSwitches are able to handle the traffic pushed into them, why would we be inclined to say that this broadcast domain is "too big?"

Broadcast domain sizing overlooks the impact on shared resources.
Almost all discussions of broadcast domain sizing overlook the fact that we need to consider what one VLAN will do to another when they share common resources.

Obvious points of contention are 802.1Q trunks which share bandwidth between all VLANs.  Less obvious points of contention are shared ASICs and buffers within a switch.  If you've ever noticed how a 100Mb/s server can hurt seven of his neighbors on a shared-ASIC switching architecture, you know what I'm getting at.

Splitting clients into different VLANs doesn't help if the switching capacity isn't there to back it up, but discussions of subnet sizing usually overlook this detail.

How are the edge ports configured?
Lets imagine that we've got a VMware vSwitch with 8 pNICs connected to 8 switch ports.

If those 8 switch ports are configured with a static aggregation (on the switch end) and IP hash balancing (on the VMware end), then we've got a single 8Gb/s port from spanning tree's perspective.  If the environment has background "noise" consisting of 100Mb/s of garbage broadcast traffic, then the ESX host gets 100Mb/s of garbage representing 1.25% of its incoming bandwidth capacity.  Not great, but not the end of the world.

If those 8 ports are configured with the default "host pinning" mechanism, then the switches have eight 1Gb/s ports from spanning tree's perspective.  The 100Mb/s of garbage is multiplied eight times.  The server gets 800Mb/s of garbage representing 10% of its incoming bandwidth capacity.  Yuck.

This distinction is important, and completely overlooked by most discussions of broadcast domain size.

So, what should we be looking at?
Sheesh, you want answers?  All I said was "we're doing it wrong", not "I have answers!"

First, I think we should be looking at the number of spanning tree edge ports.  This metric represents both the physical size of the broadcast domain and the impact on our ESX hosts.

Second, I think we should be talking about density of VLANs on trunks.  Where possible, it might be worth splitting up an ESX domain so that only certain VLANs are available on certain servers.  If the environment consists of LOTS of very sparsely populated VLANs, then an automagic VLAN pruning scheme might be worth deploying.

Third, I think we need to look at what the servers are doing.  Lots of broadcast traffic?  Maybe we should have disabled netbuei?  Maybe we shouldn't mingle the trading floor support system with the general population?

I don't have a clear strategy about how to handle discussions about broadcast domain sizing, but I'm absolutely convinced that discussions of host count and subnet size miss the point.  There's risk here, but slicing the virtual server population into lots of little VLANs obviously doesn't fix anything that a network engineer cares about.

Do you have an idea about how to better measure these problems so that we can have more useful discussions?  Please share!

6 comments:

  1. Do you have a feel for transient conditions, as when a switch clears its L2 table and starts flooding all traffic until it learns addresses again? Very large L2 domains tend to suffer more, simply because they have more switches.

    Switch software tends to dump the L2 table when the configuration changes in a way which would be difficult to handle more gracefully, for example when a VLAN is removed from a port. The developers try to flush only affected entries when it is straightforward to do so, but will take the easy way out and flush the whole table when convenient.

    ReplyDelete
  2. Yuck, what switches behave this way? Please name and shame :-)

    I can't say that I've ever noticed this behavior, and I stare at lots of packet captures. Given that ESX servers run in promiscuous mode, unnecessary flooding leads to lots of extra software-based frame processing. Ugly.

    There's a related area that's an easy fix, and often overlooked: Set access ports to STP edge type. Skip this step and server reboots can cause L2 filtering tables to age out quickly.

    If my switches are subject to the table purge problem, I'll never notice, because I don't add VLANs to production switches during business hours :-)

    BTW, Denton, I enjoyed your Jumbo frames article. My current customer (an IT department) has customers (business units) coming at them from all sides, insisting they need jumbos. Your article was timely backup to the arguments I'd been making.

    ReplyDelete
  3. Hi Chris,

    When it comes to broadcast domains, it's all about how many eggs you're comfortable with putting into a single basket. One factor, of course, is how many eggs do you have relatively to your chosen basket size - for example resiliency rules that go like "the maximum number of end-users that can be affected by a single outage shall be no more than 50,000" are not unheard of.

    But let's have a look at what may cause these eggs to crack.

    Broadcasts have a couple of interesting properties: (1) bridges are obliged to replicate a broadcast frame/packet (choose your poison - even IEEE seems unable to make up its mind on that one) to all ports attached to the same broadcast domain, except the one it came from; and (2) broadcast frames in many cases are punted to CPU, in case they are destined for bridge's control or management plane.

    Now, there are couple of scenarios worth consideration: (a) "normal" operation; and (b) "broadcast storm", a.k.a. bridging loop.

    Let's have a look at the (b) first. When a bridging loop happens (note I'm saying "when", not "if" ;), a single broadcast frame/packet travels round and round the loop, and at each hop it is multiplied by each bridge and sent to all its ports in the same broadcast domain. If your switches are fast and links are low latency, just this single frame will drive up traffic volume quite high. Now, at this point interesting things will likely start to happen - your control plane can become unstable (remember - CPU that runs your control plane *has* to process all this garbage), which in turn will cause further instability (like STP or routing protocols start to flap), and ultimately your network likely melts down.

    What about the (a)? The answer is "it depends". If we look at your original scenario, with 100 servers and 100 IP addresses on each, each broadcast frame/packet has to be processed by your server only once, despite having 100 IP addresses on each, as they are likely handled by a single copy of the networking stack. Now, if we turn to the hypothetic scenario where these 10,000 servers sit in VMs on a pair of hosts, we will see that CPU on these hosts will have to deal with each single broadcast frame/packet 5,000 times, because as it arrives, it will be replicated by vSwitch to all its 5,000 vNICs, attached to the same port group. In this case your innocent 100 Mbit/s of broadcast traffic suddenly becomes 500,000 Mbit/s (that's 500 Gbit/s, which could be over your server's total bus bandwidth). Oops.

    Hope this makes sense. :)

    Cheers,

    -- Dmitri

    ReplyDelete
  4. Hey Dmitri,

    You make perfect sense. I hope I did too. Note that I called my virtualization platform "impossibly large" for a reason :-)

    My main point here is that we don't seem to have settled on useful metrics with which to have this discussion. Even Ivan (for whom I have tons of respect) illustrates that point in one of the posts I linked. Talking about the size of a broadcast domain doesn't tell the whole story. There's so much more to consider. But we don't.

    Why do you say "when" about bridging loops? I admit that I've never seen one that wasn't deliberately caused. Have I just been lucky? The bridging loop case studies that I'm familiar with were all preventable with modern safety features and sane administrative practices (STP *guard, CoPP, L2 design, etc...)

    ReplyDelete
  5. Chris, I am glad to see this post and hope to see more comments on what metrics should be measured to calculate when one might expect to experience performance or stability issues in large L2 domains.

    Most servers today generate far less broadcast traffic than they did 10 years ago, and the bandwidth and CPU speeds have improved significantly.

    I ran a network with some large L2 domains, running 6000 VMs across hundreds of VLANs, everything trunked to all servers.

    Broadcast traffic and CPU usage, nor stability, was ever a problem.

    ReplyDelete
  6. Chris,

    Regarding bridging loops, I think it stems from the world differences - I come from service provider world, where constant configuration changes to the production network during operating hours are norm (run of the mill moves/adds/changes, not network re-design, of course). And where you have many changes happening, being done by people, mistakes are bound to happen. Things are much better now, with proper carrier-grade, largely MPLS-based kit replacing traditional VLAN bridges, but the memory remains. :)

    Traditional enterprise networks are much more stable and usually don't require as many changes, but "going cloud" may be changing that. This is why automation is so important - to prevent fat-fingering and cut-n-paste type mistakes.

    Back to the point of sizing - in my original post I was trying to say that ultimately size of a broadcast domain that is "right" for your particular circumstances is determined by (a) the size of a fault domain your risk profile permits (which is the same as broadcast domain in many cases - if it is flooded, everybody on it suffers pretty much the same); and (b) how much broadcast traffic can CPUs in your bridges and servers handle before things start to get unstable.

    ReplyDelete