Wednesday, December 28, 2011

Powering Network Gear | NetworkingNerd Follow-Up

Tom Hollingsworth published a great post about some of the common NEMA and IEC power connectors that you're likely to encounter when working on network gear in North America. Tom's post inspired me to throw together a short list of gotchas about powering network gear.

North America Power Cord
One of the many power cord choices you'll encounter when ordering a Cisco switch is this one:

CAB-N5K6A-NAPower Cord, 210/220V 30A North America

This is a post about providing power in North America, so it sounds like a good cable, right? Wrong. The male end of this cable features a NEMA 6-15 plug. It would have been nice of Cisco to mention that detail in the product description, huh? Also, I'm a little puzzled by the "30A" reference in the description. A NEMA 6-15 outlet should be backed by a 15A circuit breaker, and the C13 connector is rated for only 10A, as the sketch below indicates.

CAB-N5K6A-NA with NEMA 6-15 connector.
Drawing copied from Nexus 5000 Install Guide.
It's such a weird cable that Tom didn't bother to mention it in his cable rundown post, and I don't think I've ever seen one of these outlets in the wild. I have, however, seen a customer place a large order, and specify these cables (two of them, actually!) for every top-of-rack component.

Cable Length and Where Is The Inlet?
Consider the following power cord choices for the Nexus 5020 switch (note that this is just 1/3 of the choices for this platform):

CAB-N5K6A-NAPower Cord, 210/220V 30A North America
CAB-AC-250V/13ANorth America,NEMA L6-20 250V/20A plug-IEC320/C13 receptacle
CAB-C13-C14-2MPower Cord Jumper, C13-C14 Connectors, 2 Meter Length
CAB-9K12A-NAPower Cord, 125VAC 13A NEMA 5-15 Plug, North America
CAB-C13-CBNCabinet Jumper Power Cord, 250 VAC 10A, C14-C13 Connectors

It's pretty common to find power strips with IEC C14 outlets in server racks these days, and the Nexus 5000 has C13 power inlets. A C13-C14 cable seems like the obvious choice. But there are two of them listed here! That's because CAB-C13-C14-2M is 2m long (as indicated), and the CAB-C13-CBN is only .7m long. Again, it would have been nice of Cisco to mention this detail in the product description.

For a Top-Of-Rack Nexus 5010/5020, CAB-C13-CBN is probably the right choice. The power inlets are on the back (hot aisle side) of the switch, right where we're probably going to find a power outlet. 2m would be way too much power cord for this application.

But what about a Top-Of-Rack Nexus 5500, Nexus 2000, or fixed-configuration Catalyst switch? Those units tend to have the power inlet on the opposite side. A .7m cable would be too short, so we should order the CAB-C13-C14-2M for those.

What Voltage?
When talking about NEMA outlets, it's easy to know what voltage you're going to find. NEMA 5-xxx indicates 110V, and NEMA 6-xxx indicates 220V. But the IEC outlets can be unpredictable. They're probably going to be 220V, in spite of the fact that the C13 cords powering the system I'm sitting in front of right now deliver only 110V.

Most data center gear includes auto-ranging power supplies, so it's no problem, but things that weren't intended to live in the data center might have a voltage selector switch. I've seen Sun and Dell workstations blown up by connecting them to 220V power without setting the manual voltage selector switch.

Along those same lines, I always carry one of these adapter cords so that I can charge my laptop in the data center.
It's super-handy, but creates a situation where you might accidentally plug a 110V device into 220V power. Be careful.

Power Supplies: Bigger Isn't Better
Many of my customers will automatically select the biggest available power supply when specifying their chassis-based switches. Hey, bigger is better, and they only cost a little bit more, right?

Large power supplies generally get big power through one of two options:
  • Multiple power inputs
  • High current power inputs
The high current power inlet approach can be a problem. The largest removable power cords we have available are the C19 connector type, rated for up to 16A. Power supplies with 30A inlets like the 4000W unit for Catalyst 6500 and the 7500W unit for Nexus 7000 have fixed cords that can't be disconnected from the power supply.

4KW PSU for Catalyst 6500
7.5KW PSU for Nexus 7000

When most power supplies fail, you simply unplug the cable from the PSU, replace the PSU, and reattach the cable. With fixed cords, you'll need to fish the whole cord, along with it's huge connector out of the rack. Sometimes pulling that cord out of the rack isn't possible because too many new cables have been run through the penetration in the floor/rack/whatever since the switch was installed. Now what? Chopping off the end of the dead cord is a reasonable option, but how do you get the new PSU installed?

Power supplies with removable cords are often preferable because they simplify operations.

2 PDUs; 2 PSUs; 4 cords -- Now what?
Unfortunately, multiple input power supplies can complicate the initial design. Cisco refers to this set of issues using the phrases "input source redundancy" and "power supply redunancy." The issue boils down to how will you mesh up the four power cables between two PSUs and two PDUs?

You probably want to run both power cords from each PSU to the same circuit/PDU (assuming the PDU can deliver the required current), but people tend to split each device between multiple PDUs. If/when a PDU fails, suddenly each power supply will drop to half of it's previous capacity. If you've configured the switch for full power redundancy mode, then the switch will begin shutting down line cards until the allocated power falls below the threshold represented by half of one power supply. It's ugly, and I've seen it happen more than once.

The same thing comes up when configuring 3750X stackable switches with StackPower. Think carefully about how the power cords align with available circuits. Assume you're going to lose a circuit. Will the stack survive? If you configure every switch with dual power supplies, then it'll be fine, but you might as well skip StackPower at that point because you've just made each switch individually redundant.

To solve this problem the StackPower way, then you need to make sure that the stack power pool has sufficient capacity even when you're down a circuit. Unfortunately, there's no way to validate power cable to circuit mapping remotely. The only way to make sure that this is done correctly is to visit the closet and trace out the power cords.

Monday, December 26, 2011

Frequently Bought Together

A ridiculously small backup battery for Blackberry and a 9-pin serial gender changer.

I have an idea why amazon might have linked these two products, but it kind of blows my mind. I guess nobody else is buying those little batteries.

Tuesday, December 20, 2011

Pricing and Trading Networks: Down is Up, Left is Right

My introduction to enterprise networking was a little backward. I started out supporting trading floors, backend pricing systems, low-latency algorithmic trading systems, etc... I got there because I'd been responsible for UNIX systems producing and consuming multicast data at several large financial firms.

Inevitably, the firm's network admin folks weren't up to speed on matters of performance tuning, multicast configuration and QoS, so that's where I focused my attention. One of these firms offered me a job with the word "network" in the title, and I was off to the races.

It amazes me how little I knew in those days. I was doing PIM and MSDP designs before the phrases "link state" and "distance vector" were in my vocabulary! I had no idea what was populating the unicast routing table of my switches, but I knew that the table was populated, and I knew what PIM was going to do with that data.

More incredible is how my ignorance of "normal" ways of doing things (AVVID, SONA, Cisco Enterprise Architecture, multi-tier designs, etc...) gave me an advantage over folks who had been properly indoctrinated. My designs worked well for these applications, but looked crazy to the rest of the network staff (whose underperforming traditional designs I was replacing).

The trading floor is a weird place, with funny requirements. In this post I'm going to go over some of the things that make trading floor networking... Interesting.

Redundant Application Flows
The first thing to know about pricing systems is that you generally have two copies of any pricing data flowing through the environment at any time.  Ideally, these two sets originate from different head-end systems, get transit from different wide area service providers, ride different physical infrastructure into opposite sides of your data center, and terminate on different NICs in the receiving servers.

If you're getting data directly from an exchange, that data will probably be arriving as multicast flows. Redundant multicast flows. The same data arrives at your edge from two different sources, using two different multicast groups.

If you're buying data from a value-add aggregator (Reuters, Bloomberg, etc...), then it probably arrives via TCP from at least two different sources. The data may be duplicate copies (redundancy), or be distributed among the flows with an N+1 load-sharing scheme.

Losing One Packet Is Bad
Most application flows have no problem with packet loss. High performance trading systems are not in this category.

Think of the state of the pricing data like a spreadsheet. The rows represents a securities -- something that traders buy and sell. The columns represent attributes of that security: bid price, ask price, daily high and low, last trade price, last trade exchange, etc...

Our spreadsheet has around 100 columns and 200,000 rows. That's 20 million cells. Every message that rolls in from a multicast feed updates one of those cells. You just lost a packet. Which cell is wrong? Easy answer: All of them. If a trader can't trust his data, he can't trade.

Reconvergence Is Bad
Because we've got two copies of the data coming in. There's no reason to fix a single failure. If something breaks, you can let it stay broken until the end of the day.

What's that? You think it's worth fixing things with a dynamic routing protocol? Okay cool, route around the problem. Just so long as you can guarantee that "flow A" and "flow B" never traverse the same core router. Why am I paying for two copies of this data if you're going to push it through a single device? You just told me that the device is so fragile that you feel compelled to route around failures!

Don't Cluster the Firewalls
The same reason we don't let routing reconverge applies here. If there are two pricing firewalls, don't tell them about each other. Run them as standalone units. Put them in separate rooms, even.  We can afford to lose half of a redundant feed. We cannot afford to lose both feeds, even for the few milliseconds required for the standby firewall take over. Two clusters (four firewalls) would be okay, just keep the "A" and "B" feeds separate!

Don't team the server NICs
The flow-splitting logic applies all the way down to the servers. If they've got two NICs available for incoming pricing data, these NICs should be dedicated per-flow. Even if there are NICs-a-plenty, the teaming schemes are all bad news because like flows, application components are also disposable. It's okay to lose one. Getting one back? That's sometimes worse. Keep reading...

Recovery Can Kill You
Most of these pricing systems include a mechanism for data receivers to request retransmission of lost data, but the recovery can be a problem. With few exceptions, the network applications in use on the trading floor don't do any sort of flow control. It's like they're trying to hurt you.

Imagine a university lecture where a sleeping student wakes up, asks the lecturer to repeat the last 30 minutes, and the lecturer complies. That's kind of how these systems work.

Except that the lecturer complies at wire speed, and the whole lecture hall full of students is compelled to continue taking notes. Why should the every other receiver be penalized because one system screwed up? I've got trades to clear!

The following snapshot is from the Cisco CVD for trading systems. it shows how aggressive these systems can be. A nominal 5Mb/s trading application regularly hits wire-speed (100Mb/s) in this case.

The graph shows a small network when things are working right. A big trading backend at a large financial services firm can easily push that green line into the multi-gigabit range. Make things interesting by breaking stuff and you'll over-run even your best 10Gb/s switch buffers (6716 cards have 90MB per port) easily.

Slow Servers Are Good
Lots of networks run with clients deliberately connected at slower speeds than their server. Maybe you have 10/100 ports in the wiring closet and gigabit-attached servers. Pricing networks require exactly the opposite. The lecturer in my analogy isn't just a single lecturer. It's a team of lecturers. They all go into wire-speed mode when the sleeping student wakes up.

How will you deliver multiple simultaneous gigabit-ish multicast streams to your access ports? You can't. I've fixed more than one trading system by setting server interfaces down to 100Mb/s or even 10Mb/s. Fast clients, slow servers is where you want to be.

Slowing down the servers can turn N*1Gb/s worth of data into N*100Mb/s -- something we can actually handle.

Bad Apple Syndrome
The sleeping student example is actually pretty common. It's amazing to see the impact that can arise from things like:
  • a clock update on a workstation
  • ripping a CD with iTunes
  • briefly closing the lid on a laptop
The trading floor is usually a population of Windows machines with users sitting behind them. Keeping these things from killing each other is a daunting task. One bad apple will truly spoil the bunch.

How Fast Is It?
System performance is usually measured in terms of stuff per interval. That's meaningless on the trading floor. The opening bell at NYSE is like turning on a fire hose. The only metric that matters is the answer to this question: Did you spill even one drop of water?

How close were you to the limit? Will you make it through tomorrow's trading day too?

I read on twitter that Ben Bernanke got a bad piece of fish for dinner. How confident are you now? Performance of these systems is binary. You either survived or you did not. There is no "system is running slow" in this world.

Routing Is Upside Down
While not unique to trading floors, we do lots of multicast here. Multicast is funny because it relies on routing traffic away from the source, rather than routing it toward the destination. Getting into and staying in this mindset can be a challenge. I started out with no idea how routing worked, so had no problem getting into the multicast mindset :-)

Almost every network protocol relies on data receivers ACKnowledging their receipt of data. But not here. Pricing systems only speak up when something goes missing.

QoS Isn't The Answer
QoS might seem like the answer to make sure that we get through the day smoothly, but it's not. In fact, it can be counterproductive.

QoS is about managed un-fairness... Choosing which packets to drop. But pricing systems are usually deployed on dedicated systems with dedicated switches. Every packet is critical, and there's probably more of them than we can handle. There's nothing we can drop.

Making matters worse, enabling QoS on many switching platforms reduces the buffers available to our critical pricing flows, because the buffers necessarily get carved so that they can be allocated to different kinds of traffic. It's counter intuitive, but 'no mls qos' is sometimes the right thing to do.

Load Balancing Ain't All It's Cracked Up To Be
By default, CEF doesn't load balance multicast flows. CEF load balancing of multicast can be enabled and enhanced, but doesn't happen out of the box.

We can get screwed on EtherChannel links too: Sometimes these quirky applications intermingle unicast data with the multicast stream. Perhaps a latecomer to the trading floor wants to start watching Cisco's stock price.  Before he can begin, he needs all 100 cells associated with CSCO. This is sometimes called the "Initial Image." He ignores updates for CSCO until he's got the that starting point loaded up.

CSCO has updated 9000 times today, so the server unicasts the initial image: "Here are all 100 cells for CSCO as of update #9000: blah blah blah...". Then the price changes, and the server multicasts update #9001 to all receivers.

If there's a load balanced path (either CEF or an aggregate link) between the server and client, then our new client could get update 9001 (multicast) before the initial image (unicast) shows up. The client will discard update 9001 because he's expecting a full record, not an update to a single cell.

Next, the initial image shows up, and the client knows he's got everything through update #9000. Then update #9002 arrives. Hey, what happened to #9001?

Post-mortem analysis of these kinds of incidents will boil down to the software folks saying:
We put the messages on the wire in the correct order. They were delivered by the network in the wrong order.
ARP Times Out
NACK-based applications sit quietly until there's a problem. So quietly that they might forget the hardware address associated with their gateway or with a neighbor.

No problem, right? ARP will figure it out... Eventually. Because these are generally UDP-based applications without flow control, the system doesn't fire off a single packet, then sit and wait like it might when talking TCP. No, these systems can suddenly kick off a whole bunch of UDP datagrams destined for a system it hasn't talked to in hours.

The lower layers in the IP stack need to hold onto these packets until the ARP resolution process is complete. But the packets keep rolling down the stack! The outstanding ARP queue is only 1 packet deep in many implementations. The queue overflows and data is lost. It's not strictly a network problem, but don't worry. Your phone will ring.

Losing Data Causes You to Lose Data
There's a nasty failure mode underlying the NACK-based scheme. Lost data will be retransmitted. If you couldn't handle the data flow the first time around, why expect to handle wire speed retransmission of that data on top of the data that's coming in the next instant?

If the data loss was caused by a Bad Apple receiver, then all his peers suffer the consequences. You may have many bad apples in a moment. One Bad Apple will spoil the bunch.

If the data loss was caused by an overloaded network component, then you're rewarded by compounding increases in packet rate. The exchanges don't stop trading, and the data sources have a large queue of data to re-send.

TCP applications slow down in the face of congestion. Pricing applications speed up.

Packet Decodes Aren't Available
Some of the wire formats you'll be dealing with are closed-source secrets. Others are published standards for which no WireShark decodes are publicly available. Either way, you're pretty much on your own when it comes to analysis.

Responding to Will's question about data sources: The streams come from the various exchanges (NASDAQ, NYSE, FTSE, etc...) Because each of these exchanges use their own data format, there's usually some layers of processing required to get them into a common format for application consumption. This processing can happen at a value-add data distributor (Reuters, Bloomberg, Activ), or it can be done in-house by the end user. Local processing has the advantage of lower latency because you don't have to have the data shipped from the exchange to a middleman before you see it.

Other streams come from application components within the company. There are usually some layers of processing (between 2 and 12) between a pricing update first hitting your equipment, and when that update is consumed by a trader. The processing can include format changes, addition of custom fields, delay engines (delayed data can be given away for free), vendor-switch systems (I don't trust data vendor "A", switch me to "B"), etc...

Most of those layers are going to be multicast, and they're going to be the really dangerous ones, because the sources can clobber you with LAN speeds, rather than WAN speeds.

As far as getting the data goes, you can move your servers into the exchange's facility for low-latency access (some exchanges actually provision the same length of fiber to each colocated customer, so that nobody can claim a latency disadvantage), you can provision your own point-to-point circuit for data access, you can buy a fat local loop from a financial network provider like BT/Radianz (probably MPLS on the back end so that one local loop can get you to all your pricing and clearing partners), or you can buy the data from a value-add aggregator like Reuters or Bloomberg.

Responding to Will's question about SSM:  I've never seen an SSM pricing component. They may be out there, but they might not be a super good fit. Here's why: Everything in these setups is redundant, all the way down to software components. It's redundant in ways we're not used to seeing in enterprises. No load-balancer required here. The software components collaborate and share workload dynamically. If one ticker plant fails, his partner knows what update was successfully transmitted by the dead peer, and takes over from that point. Consuming systems don't know who the servers are, and don't care. A server could be replaced at any moment.

In fact, it's not just downstream pricing data that's multicast. Many of these systems use a model where the clients don't know who the data sources are. Instead of sending requests to a server, they multicast their requests for data, and the servers multicast the replies back. Instead of:
<handshake> hello server, nice to meet you. I'd like such-and-such.
it's actually:
hello? servers? I'd like such-and-such! I'm ready, so go ahead and send it whenever...
Not knowing who your server is kind of runs counter to the SSM ideal. It could be done with a pool of servers, I've just never seen it.

The exchanges are particularly slow-moving when it comes to changing things. The modern exchange feed, particularly ones like the "touch tone" example I cited are literally ticker-tape punch signals wrapped up in an IP multicast header.

The old school scheme was to have a ticker tape machine hooked to a "line" from the exchange.  Maybe you'd have two of them (A and B again). There would be a third one for retransmit. Ticker machine run out of paper? Call the exchange, and here's more-or-less what happens:

  • Somebody at the exchange cuts the section of paper containing the updates you missed out of their spool of ticker tape.  Actual scissors are involved here.
  • Next, they grab a bit of header tape that says: "this is retransmit data for XYZ Bank".
  • They tape these two pieces of paper together, and feed them through a reader that's attached to the "retransmit line"
  • Every bank in New York will get the retransmits, but they'll know to ignore them because of the header.
  • XYZ Bank clips the retransmit data out of the retransmit ticker machine, and pastes it into place on the end where the machine ran out of paper.
These terms "tick" "line" and "retransmit", etc... all still apply with modern IP based systems. I've read the developer guides for these systems (to write wireshark decodes), and it's like a trip back in time. Some of these systems are still so closely coupled to the paper-punch system that you get chads all over the floor and paper cuts all over your hands just from reading the API guide :-)

Nerd Humor - Naming Software Projects

There's a long tradition of using clever and humorous names for open source software projects. I've recently been introduced to a particularly striking example, and it's got me thinking about some of the funny language games played by open source software folks.

In the 1960s there was Basic Combined Programming Language (BCPL), commonly known as B. It gave way to a new language: C, which happened to be the next letter in BCPL. At this point, BCPL began to be referred to by the backronym "Before C Programming Language." Next, C was followed up not by P, but by C++, because the ++ operator is how you increment something in C. Then Microsoft gave us C sharp, which includes the musical notation roughly analogous to "increment by one", and kind of looks like two "++" operators. Har har.

In the 1980's, rms decided the world needed a truly free UNIX-like operating system.  He named his project GNU, which of course stood for "GNU's Not Unix", leading to much recursive acronym hilarity.  GNU's Hurd kernel stands for "Hird of Unix Replacing Daemons", and Hird stands for "Hurd of Interfaces Representing Depth." Oh my.

The first email client I ever used was Elm (ELectronic Mail), which I later abandoned for Pine (backronymed: Pine Is Not ELM).  Due to licensing restrictions, the University of Washington stopped development of Pine, and shifted their effort to an Apache Licensed version of Pine:  Alpine.  Plus, they're in the Northwest corner of the United States of America.  Lots of evergreen trees up there from what I understand.  Also mountains.  Trees on mountains, even.

I breifly experimented with some Instant Messaging server software called TwoCan.  I can't find their logo anymore, but it consisted of two cans and a string. Chat technology at its finest! Also, it reminds me a lot of Jeff Fry's avatar on the twitter.

The NoCat project is a wifi sharing scheme somewhat like a free/open version of what iPass offers. The project's logo makes it perfectly clear that there are no cats involved. They explain the name this way:

Albert Einstein, when asked to describe radio, replied:
"You see, wire telegraph is a kind of a very, very long cat. You pull his tail in New York and his head is meowing in Los Angeles. Do you understand this? And radio operates exactly the same way: you send signals here, they receive them there. The only difference is that thereis no cat."

The project that got me thinking about clever names and logos is the Linux Pacemaker component of the Linux-HA server clustering project. Pacemaker is an add-on to the heartbeat daemon. Get it? Their logo is a set of stylized rabbit ears, so they've got both the EKG/heartbeat thing, as well as the "set the pace for high performace" rabbit thing going on.  Clever stuff.

And then there's STONITH, pacemaker's awesome dual-active / split-brain remediation mechanism. STONITH stands for:
Shoot him in the head? Well, that should take care of any dual-active problems, all right! The implementation is just as dramatic as the name suggests: Basically, it boils down to each node in the cluster being logged into the other node's power strip. Misbehave and I'll cut your power!

Split-brain / dual-active detection and remediation is something with which we're familiar in the networking department.  I'm a little bit disappointed that we don't have anything as crazy / awesome as STONITH in our toolbox...
STONITH as imagined by Tim Serong

These were just a few examples off the top of my head. What funny / clever / layered meaning project names have I missed?

Monday, December 19, 2011

Jumbo Frames, Support, LACP and Performance : Picking Nits

Denton Gentry wrote a great article in which he explained why jumbo frames are not the panacea that so many people expect them to be.

It was a timely article for me to find because I happened to have been going around in circles with a few different user groups who were sure that jumbo frames are the solution to all of their problems, but they're unwilling to do the administrative work required for implementation, nor the analytic work to see whether there's anything to be gained.

The gist of Denton's argument is that jumbo frames are just one way of reducing the amount of work required for a server to send a large volume of data.  Modern NICs and drivers have brought us easier to support ways of accomplishing the same result.  The new techniques work even when the system we're talking to doesn't support jumbos, and they even work across intermediate links with a small MTU.

Jumbo frames reduce the server workload because larger frames means fewer per-frame operations need to be performed.  TCP offload tricks reduce workload by eliminating per-frame operations altogether.  The only remaining advantage for jumbo frames is the minuscule amount of bandwidth saved from sending fewer headers in an end-to-end jumbo-enabled environment.

There's a small facet to this discussion that's been niggling at me, but I've been hesitant to bring it up because I'm not sure how significant it is.

Why not just use Jumbos?
I'm always hesitant to enable jumbo frames for a customer because tends to be a difficult configuration to support.  Sure, typing in the configuration is easy, but that configuration takes us down a non-standard rabbit hole where too few people understand the rules.

Every customer I've worked with has made mistakes in this regard.  It's a support nightmare that leads to lots of trouble tickets because somebody always forgets to enable jumbos when they deploy a new server/router/switch/coffeepot.

The Rules
  1. All IP hosts sharing an IP subnet need to agree on the IP MTU in use on that subnet.
  2. All L2 gear supporting that subnet must be able to handle the largest frame one of those hosts might generate.
Rule 1 means that if you're going to enable jumbo frames on a subnet, you need to configure all systems at the same time.  All servers, desktops, appliances, routers on the segment need to agree.  This point is not negotiable.  PMTUD won't fix things if they don't agree.  Nor will TCP's MSS negotiation mechanism.  Just make them match.

Rule 2 means that all switches and bridges have to support at least the largest frame.  Larger is okay, smaller is not.  The maximum frame size value will not be the same as the IP MTU, because it needs to take into account the L2 header.

For extra amusement, different products (even within a single vendor's product lineup) don't agree about how the MTU configuration directives are supposed to be interpreted, making the rules tough to follow.

So, what's been niggling at me?
In a modern (Nexus) Cisco data center, we push servers towards using LACP instead of active/standby redundancy.  There are various reasons for this preference relating to orphan ports, optimal switching path, the relative high cost of an East/West trip across the vPC peer-link, being confident that the "standby" NIC and switchport are configured correctly, etc...  LACP to the server is good news for all these reasons.

But it's bad news for another reason.  While aggregate links on a switch are free because they're implemented in hardware, aggregation at the server is another story.  Generally speaking, servers instantiate a virtual NIC, and apply their IP configuration to it.  The virtual NIC is a software wedge between the upper stack layers and the hardware.  It's not free, and it is required to process every frame/PDU/write/whatever handed down from software to hardware, and vice versa.

So , when we turn on LACP on the server, we add per-PDU software processing that wasn't there before, re-kindling the notion that larger PDUs are better.  The various TCP offload features can probably be retained, and the performance of aggregate links is generally good.  YMMV, check with your server OS vendor.

I'm not sure that we're forcing the server folks to take a step backwards in terms of performance, but I'm afraid that we're supplying a foothold for the pro-jumbo argument which should have ended years ago.

Tuesday, December 6, 2011

Thinking about sizing the broadcast domain

Ivan has recently written a couple of posts that have inspired me to put on paper some thoughts about broadcast domain sizing.

We all intuitively know that a "too big" broadcast domain is a problem.  But how big is too big, and what are the relevant metrics?

There was a time when servers did lots of irresponsible broadcasting, but in my experience, stuff that's installed in today's virtual data centers is much better behaved than the stuff of a decade ago.

...and just what is a "broadcast storm" anyway?  Most descriptions I've read are describing something that can be much better categorized as a "bridging loop".  If dozens or hundreds of servers are producing broadcast frames, I benevolently assume that it's because the servers expect us to deliver the frames.  Either that, or the servers are broken, and should be fixed.

I have a background in supporting trading systems that regularly produce aggregate broadcast/multicast rates well in excess of 1Gb/s, and that background probably informs my thinking on this point.  The use of broadcast and multicast suppression mechanisms seems heavy handed.  What is this traffic exactly?  Why is the server sending it?  Why don't I want to deliver it?  QoS is a much more responsible way to handle this problem if/when there really is one.

Whew, I'm well off track already!

The central point here is that I believe we talk about the wrong things when discussing the size and scope of our L2 networks.  Here's why.

Subnet size is irrelevant.
I used to run a network that included a /16 server access LAN.  The subnet was shockingly full, but didn't really need a full /16.  A /18 would have worked, but /19 would have been too small.  The L2 topology consisted of Catalyst 2900XL switches (this was a while ago).  Was it "too big?"

No.  It worked perfectly.

There were only about 100 nodes on this network, and no server virtualization.  Each node had around 100 IP addresses configured on its single NIC.

The scheme here was that each server had the potential to run each of 100 different services.  I used the third octet to identify the service, and the fourth octet to identify the server.  So service 25 on server 27 could be found at (for example).

Broadcast frames in this environment were limited to the occasional ARP query.

Host count is irrelevant.
I expect to make a less convincing case here, but my point is that when talking about a virtualized data center, I don't believe we should care about how many (virtual) hosts (IP stacks?) share a VLAN.

The previous example used lots of IP addresses, but only had a small number of hosts present.  Now I want to flip the proportions around.  Let's imagine instead that we have 10,000 virtual machines in a VLAN, but were somehow able to virtualize them into a single pair of impossibly-large servers.

Is there any problem now?  Our STP topology is ridiculously small at just two ports.  If the ESX hosts and vSwitches are able to handle the traffic pushed into them, why would we be inclined to say that this broadcast domain is "too big?"

Broadcast domain sizing overlooks the impact on shared resources.
Almost all discussions of broadcast domain sizing overlook the fact that we need to consider what one VLAN will do to another when they share common resources.

Obvious points of contention are 802.1Q trunks which share bandwidth between all VLANs.  Less obvious points of contention are shared ASICs and buffers within a switch.  If you've ever noticed how a 100Mb/s server can hurt seven of his neighbors on a shared-ASIC switching architecture, you know what I'm getting at.

Splitting clients into different VLANs doesn't help if the switching capacity isn't there to back it up, but discussions of subnet sizing usually overlook this detail.

How are the edge ports configured?
Lets imagine that we've got a VMware vSwitch with 8 pNICs connected to 8 switch ports.

If those 8 switch ports are configured with a static aggregation (on the switch end) and IP hash balancing (on the VMware end), then we've got a single 8Gb/s port from spanning tree's perspective.  If the environment has background "noise" consisting of 100Mb/s of garbage broadcast traffic, then the ESX host gets 100Mb/s of garbage representing 1.25% of its incoming bandwidth capacity.  Not great, but not the end of the world.

If those 8 ports are configured with the default "host pinning" mechanism, then the switches have eight 1Gb/s ports from spanning tree's perspective.  The 100Mb/s of garbage is multiplied eight times.  The server gets 800Mb/s of garbage representing 10% of its incoming bandwidth capacity.  Yuck.

This distinction is important, and completely overlooked by most discussions of broadcast domain size.

So, what should we be looking at?
Sheesh, you want answers?  All I said was "we're doing it wrong", not "I have answers!"

First, I think we should be looking at the number of spanning tree edge ports.  This metric represents both the physical size of the broadcast domain and the impact on our ESX hosts.

Second, I think we should be talking about density of VLANs on trunks.  Where possible, it might be worth splitting up an ESX domain so that only certain VLANs are available on certain servers.  If the environment consists of LOTS of very sparsely populated VLANs, then an automagic VLAN pruning scheme might be worth deploying.

Third, I think we need to look at what the servers are doing.  Lots of broadcast traffic?  Maybe we should have disabled netbuei?  Maybe we shouldn't mingle the trading floor support system with the general population?

I don't have a clear strategy about how to handle discussions about broadcast domain sizing, but I'm absolutely convinced that discussions of host count and subnet size miss the point.  There's risk here, but slicing the virtual server population into lots of little VLANs obviously doesn't fix anything that a network engineer cares about.

Do you have an idea about how to better measure these problems so that we can have more useful discussions?  Please share!