Monday, April 9, 2012

Ethernet Fabric - The Bulb Glows Dimly

People talking about "Ethernet Fabrics" are usually describing a scheme in which many switches are interconnected and all links area available for forwarding Ethernet frames.

Rather than allowing STP to block links until it forms a loop-free topology, Fabrics include an L2 multipath scheme which forwards frames along the "best" path between any two endpoints.

Brandon Carrol outlined the basics of an Ethernet fabric here, and his description leaves me with the same question that I've had since I first heard about this technology: What problem can I solve with it?

The lightbulb over my head began to glow during one of Brocade's presentations (pop quiz: what switch is the STP root in figure 1 of the linked document?) at the Gestalt IT Fabric Symposium a couple of weeks ago. In that session, Chip Copper suggested that a traditional data center topology with many blocked links and sub-optimal paths like this one:

Three-tier architecture riddled with downsides

might be rearranged to look like this:
Flat topology. All links are available to forward traffic. It's all fabricy and stuff.

The advantages of the "Fabric" topology are obvious:
  • Better path selection: It's only a single hop between any two Access switches, where the previous design required as many as four hops.
  • Fewer devices: We're down from 11 network devices to 6
  • Fewer links: We're down from 19 infrastructure links to 15
  • More bandwidth: Aggregate bandwidth available between access devices is up from 120Gb/s to 300Gb/s (assuming 10 Gb/s links)
If I were building a network to support a specialized, self-contained compute cluster, then this sort of design is an obvious choice.

But that's not what my customers are building. The networks in my customers' data centers need to support modular scaling (full mesh designs like I've pictured here don't scale at all, let alone modularly) and they need any-vlan-anywhere support from the physical network.

So how does a fabric help a typical enterprise?
The scale of the 3-tier diagram I presented earlier is way off, and that's why fully meshing the Top of Rack (ToR) devices looks like a viable option. A more realistic topology in a large enterprise data center might have 10-20 pairs of aggregation devices and hundreds of Top of Rack devices living in the server cabinets.

Obviously, we can't fully mesh hundreds of ToR devices, but we can mesh the aggregation layer and eliminate the core! The small compute cluster fabric topology isn't very useful or interesting to me, but eliminating the core from a typical enterprise data center is really nifty. The following picture shows a full mesh of aggregation switches with fabric-enabled access switches connected around the perimeter:
Two-tier fabric design
Advantages of this design:
  • Access switches are never more than 3 hops from each other.
  • Hop count can be lowered by running a cable
  • No choke point at the network core.
  • Scaling: The most densely populated switch shown here only uses 13 links. This can grow big.
  • Scaling: Monitoring shows a link running hot? Turn up a parallel link.
Why didn't I see this before?
Honestly, I'm not sure why it took so long to pound this fabric use case through my skull. I think there are a number of factors:
  • Marketing materials for fabrics tend to focus on the simple full mesh case, and go out of their way to bash the three-tier design. A two-tier design fabric doesn't sound different enough.
  • Fabric folks also talk a lot about what Josh O'Brien calls "monkeymesh" - the idea that we can build links all willy-nilly and have things work. One vendor reportedly has a commercial with children cabling the network however they see fit, and everything works fine. This is not a useful philosophy. Structure is good!
  • The proposed topology represents a rip-and-replace of the network core. This probably hasn't been done too many times yet :-)

10 comments:

  1. I like the design and think we'll see these more in the future.

    But...where is your L3 gateway for each network?

    ReplyDelete
    Replies
    1. Re: L3 gateway. Good catch. This is the topic of a couple of upcoming posts. Short answer: it depends on the vendor :-)

      Delete
    2. Ha. I know I was going to ask "based on your vendor of choice." These designs are great and really make you think what the requirements REALLY are. Not designing networks a certain way for the heck of it (even if we end back up where we started!) makes one think twice about WHY. I wonder if we'll see networks like this just attach to maybe 2 x [cisco] ASR routers for all L3 connectivity if east/west traffic truly dominates. 9006 routers have some amazing bundle pricing right now and they can provide some nice advanced WAN features as well. Otherwise "any" L3 switch will suffice as well, but then do you pick two agg to connect to this "l3 core" or connect them all to minimize hops. Anyway, the nice thing is that these types of discussions are happening more these days.

      All depends on if customers really know the requirements or if over provisioning will stay the norm...just in case :)

      Delete
  2. Nice to see a post going off the beatin path!

    This looks like a colapsed core design just without dedicated aggregation switches for each access module. I can definitly see the flexability benifit of this but I would be concerned about the scaling of the aggregation switches when you need to grow them. But on the other hand when you grow them to this point you are getting into a good size DC and might need to consider dropping in a core.

    Thanks for the post and getting my design gears turning this morning!!

    ReplyDelete
  3. Most vendors seem to be showing Clos-style designs for their fabrics, rather than a full mesh (or partial mesh topologies like Torus or Hypercube).

    Clos networks have a constant number of hops between any two hosts, and can scale to ridiculous proportions: more than 65536 host-facing ports with 64-port routers and 3 stages, with ZERO oversubscription. Of course, that comes at the cost of lots of long fiber links between layers as well as packaging and cabling issues.

    Vendors also seem to be showing security and edge routers down at the leaves of the Clos network, which is fine since bisectional bandwidth is constant.

    Coming back to earth, HPC sites have used Torus/Hypercube topologies for years. They cost a *lot* less to build, and will probably work fine for the majority of datacenters, despite the slightly variable latency and number of hops. When each hop is less than 1μs, who cares if it takes 7 or 9 layer 2 hops to get across a datacenter? The bisection bandwidth is still there.

    ReplyDelete
  4. Wait a sec....3 Aggs? When people say Agg i think 7Ks. Is there a model for 3 7Ks?

    ReplyDelete
  5. @Will - It's a totally stupid design, but it's not mine. Like I said, it's Brocade's competitive example, and that's why I used it. You could do this, but it's ridiculous. It's a slightly cleaned-up version of the one that appears in the Brocade presentation I linked above. I'm disappointed that I haven't had any takers on the "pop quiz" I mentioned, because if you think this topology is ugly, you should see the presentation :-)

    ReplyDelete
    Replies
    1. Haha - wow that is an interesting topology. The new STP - it has no root. It just floats. I love the middle aggregation switch and the right core. They are going to have fun communicating with each other.

      Delete
  6. my first choice for root would be the first agg switch from the left, but even the Core on the right is blocking a link to it.

    ReplyDelete
  7. I'm going with one of the Core's being the root.

    ReplyDelete