Monday, June 18, 2012

Nexus 7004 - WTF?

Where's The Fabric?

Nexus 7004 photo by Ethan Banks
On display at Cisco Live in San Diego was a new Nexus 7000 chassis, the 4-slot Nexus 7004.

The first thing that I noticed about this chassis is that it doesn't take the same power supplies as its 9, 10, and 18 slot siblings. According to the guy at the Cisco booth, the 7004 uses 3KW power supplies that will live in the four small slots at the bottom of the chassis.

The next thing I looked for were the fabric modules. Imagine my surprise to learn that the 7004 doesn't have fabric modules!

No fabric modules? This is pretty interesting, but before we get into the specifics, a quick recap of how the larger Nexus 7K chassis work is probably in order...

The Nexus 7000 uses a 3 stage fabric where frames/packets moving from one line card to another pass through three distinct fabric layers:
  1. Fabric on the ingress line card - The fabric here connects the various replication engine or switch-on-chip (SoC) ASICs together. Traffic that ingresses and egresses on the same line card doesn't need to go any further than this fabric layer. Traffic destined for a different card or the supervisor progresses to:
  2. Chassis fabric - There are up to five "fabric modules" installed in the chassis. These modules each provide either 46Gb/s or 110Gb/s to the line cards, depending on which card we're talking about.
  3. Fabric on the egress line card - This is the same story as the ingress card's fabric, but our data is hitting this fabric from the chassis fabric, rather than from a front panel port.
Proxy-routed packets in an F1/M1 chassis take an even more complicated path, because they have the privilege of hitting the fabric twice, as they traverse up to 3 different line cards. 5 fabric hops in total.

The interconnections between line cards and fabric modules look something like the following:
Nexus 7009/7010/7018 with fabric modules

Note that I'm not completely sure about the speed of the interconnection between supervisor and fabric (black lines), but it doesn't really matter because control plane stuff is pretty low bandwidth and aggressively throttled (CoPP) anyway.

The important takeaways from the drawing above are:

  1. Fabric 2 modules can run at M1/F1 speeds (46Gb/s per slot), and at F2 speeds (110 Gb/s per slot) simultaneously. They provide the right amount of bandwidth to each slot.
  2. While there are 5 fabric modules per chassis, each card (at least the M1/F1/F2 - not sure about supervisors) has 10 fabric connections, with 2 connections to each fabric module.
It's been commonly explained that the 7004 chassis is able to do away with the fabric modules by interconnecting its pair of payload slots (two slots are dedicated for supervisors) back-to-back, and eliminating the chassis fabric stage of the 3-stage switching fabric.

...Okay... But what about control plane traffic? On the earlier chassis, the control plane traffic takes the following path:
  1. Ingress port
  2. Ingress line card fabric
  3. Chassis fabric
  4. Supervisor
With no chassis fabric, it appears that there's no way for control plane traffic to get from a line card to the supervisor. Well, it turns out that the 7004 doesn't dedicate all of the line card's fabric to a back-to-back connection.

I think that the following diagram explains how it works, but I haven't seen anything official: 8 of the fabric channels on each line card connect to the peer line card, and one channel is dedicated to each supervisor. Something like this:

Nexus 7004 - no fabric modules
Cool, now we have a card <-> supervisor path, but we don't have a full line-rate fabric connection between the two line cards in the 7004. Only 8 fabric channels are available to the data plane because two channels have been dedicated for communication with the supervisors.

F2 cards clearly become oversubscribed, because they've got 480 Gb/s of front panel bandwidth, adequate forwarding horsepower, but only 440Gb/s of fabric.

I believe that F1 cards would have the same problem, with only 184 Gb/s (eight 23 Gb/s fabric channels), but now we're talking about an F1-only chassis with no L3 capability. I'm not sure whether that is even a supported configuration.

M1 cards would not have a problem, because their relatively modest forwarding capability would not be compromised by an 8 channel fabric.

Having said all that, the oversubscription on the F2 module probably doesn't matter: Hitting 440Gb/s on the back-to-back connection would require that 44 front panel ports on a single module are forwarding only to the other module. Just a little bit of card-local switching traffic would be enough to ensure that the backplane is not oversubscribed.

Brocade Fabric Symposium

Network Fabric?
Photo and weave by Travis Meinolf
The first network vendor I met on my recent trip to San Jose (disclaimer about who paid for what here) was Brocade. Their presentations are what finally got the wheels turning in my head: MLAG is cool, but it's not going to be enough. Ethernet fabrics will be relevant to my non-technical enterprise customers.

The Cliffs Notes version of the Brocade presentation is that they make data center network switches with Ethernet ports, and because of Brocade's storied history with SAN fabrics, doing multipath bridged Ethernet fabrics is second nature for them.

Three things have stood out about the Brocade presentations that I've seen:

  1. Brocade is the only vendor I've seen who makes a point of talking about power consumed per unit of bandwidth. I presume that the numbers must be compelling, or else they wouldn't bring it up, but I have not done a comparison on this point.
  2. Per-packet load balancing on aggregate links. This is really cool, see below
  3. MLAG attachment of non-fabric switches to arbitrary fabric nodes. Also really cool, maybe I'll get around to covering this one day.
Per-packet Load Balancing on Aggregate Links
We all know that the Link Selection Algorithms (LSA) used by aggregate links (LACP, EtherChannel, bonded interfaces... Some vendors even call them trunks) choose the egress link by hashing the frame/packet header.

LSAs work this way in order to maintain ordered delivery of frames: Putting every frame belonging to a particular flow onto the same egress interface ensures that the frames won't get mixed up on their way to their destination. Ordered delivery is critical, but strict flow -> link mapping means that loads don't get balanced evenly most of the time. It also means that:
  • You may have to play funny games with the number of links you're aggregating.
  • Each flow is limited to the maximum bandwidth of a single link member.
  • Fragments of too-big IP packets might get mis-ordered if your LSA uses protocol header data

Brocade's got a completely different take on this problem, and it kind of blew my mind: They do per-packet load balancing!

The following animation illustrates why per-packet load balancing is helpful:


Pay particular attention to the two frames belonging to the green flow. Don't pay attention to the aggregation's oval icon moving around and alternating transparency. I'm a better network admin than I am an animator :)

When the first green flow frame arrives at the left switch, only one link is free, so it is forwarded on the lower link because it's available.

When the 2nd green frame arrives at the left switch, both transmit interfaces are busy, so it sits in a buffer for a short time, until the upper link finishes transmitting the blue frame. As soon as the upper link becomes available, the 2nd green frame makes use of it, even though the earlier green frame used the lower link.

This is way cool stuff. It allows for better utilization of resources, and lower congestion-induced jitter than you'd get with other vendors implementations.

In the example above, there's little possibility that frames belonging to any given flow will get mis-ordered because the packets are queued in order, and the various links in the aggregation have equal latency.

But what if the latency on the links isn't equal?


Now the two links in our aggregation are of different lengths, so they exhibit different latencies.

Just like in the previous example, the first green frame uses the lower link (now with extra latency) and the second green frame is queued due to congestion. But when the blue frame clears the upper interface, the green frame doesn't follow directly on its heels. Instead, the green frame sits in queue a bit longer (note the long gap between blue and green on the upper link) to ensure that it doesn't arrive at the right switch until after the earlier green frame has completely arrived.

Neato.

Now, does the Brocade implementation really work this way? I have no idea :) Heck, I don't even know if these are cut-through switches, but that's how I've drawn them. Even if this isn't exactly how it works, the per-packet load balancing scheme and the extra delay to compensate for mismatched latency are real, and they're really cool.

The gotcha about this stuff? All member links need to terminate on a single ASIC within each switch. You're not going to be spreading these aggregate links across line cards, so these sort of aggregations are strictly for scaling bandwidth, and not useful for guarding against failure scenarios where a whole ASIC is at risk.