Fragmentation Needed: Ethernet Fabric

Monday, April 9, 2012

Ethernet Fabric - The Bulb Glows Dimly

People talking about "Ethernet Fabrics" are usually describing a scheme in which many switches are interconnected and all links area available for forwarding Ethernet frames.

Rather than allowing STP to block links until it forms a loop-free topology, Fabrics include an L2 multipath scheme which forwards frames along the "best" path between any two endpoints.

Brandon Carrol outlined the basics of an Ethernet fabric here, and his description leaves me with the same question that I've had since I first heard about this technology: What problem can I solve with it?

The lightbulb over my head began to glow during one of Brocade's presentations (pop quiz: what switch is the STP root in figure 1 of the linked document?) at the Gestalt IT Fabric Symposium a couple of weeks ago. In that session, Chip Copper suggested that a traditional data center topology with many blocked links and sub-optimal paths like this one:

Three-tier architecture riddled with downsides

might be rearranged to look like this:

Flat topology. All links are available to forward traffic. It's all fabricy and stuff.

The advantages of the "Fabric" topology are obvious:

Better path selection: It's only a single hop between any two Access switches, where the previous design required as many as four hops.
Fewer devices: We're down from 11 network devices to 6
Fewer links: We're down from 19 infrastructure links to 15
More bandwidth: Aggregate bandwidth available between access devices is up from 120Gb/s to 300Gb/s (assuming 10 Gb/s links)

If I were building a network to support a specialized, self-contained compute cluster, then this sort of design is an obvious choice.

But that's not what my customers are building. The networks in my customers' data centers need to support modular scaling (full mesh designs like I've pictured here don't scale at all, let alone modularly) and they need any-vlan-anywhere support from the physical network.

So how does a fabric help a typical enterprise?

The scale of the 3-tier diagram I presented earlier is way off, and that's why fully meshing the Top of Rack (ToR) devices looks like a viable option. A more realistic topology in a large enterprise data center might have 10-20 pairs of aggregation devices and hundreds of Top of Rack devices living in the server cabinets.

Obviously, we can't fully mesh hundreds of ToR devices, but we can mesh the aggregation layer and eliminate the core! The small compute cluster fabric topology isn't very useful or interesting to me, but eliminating the core from a typical enterprise data center is really nifty. The following picture shows a full mesh of aggregation switches with fabric-enabled access switches connected around the perimeter:

Two-tier fabric design

Advantages of this design:

Access switches are never more than 3 hops from each other.
Hop count can be lowered by running a cable
No choke point at the network core.
Scaling: The most densely populated switch shown here only uses 13 links. This can grow big.
Scaling: Monitoring shows a link running hot? Turn up a parallel link.

Why didn't I see this before?

Honestly, I'm not sure why it took so long to pound this fabric use case through my skull. I think there are a number of factors:

Marketing materials for fabrics tend to focus on the simple full mesh case, and go out of their way to bash the three-tier design. A two-tier design fabric doesn't sound different enough.
Fabric folks also talk a lot about what Josh O'Brien calls "monkeymesh" - the idea that we can build links all willy-nilly and have things work. One vendor reportedly has a commercial with children cabling the network however they see fit, and everything works fine. This is not a useful philosophy. Structure is good!
The proposed topology represents a rip-and-replace of the network core. This probably hasn't been done too many times yet :-)

10 comments:

Jason EdelmanApril 10, 2012 at 1:28 PM
I like the design and think we'll see these more in the future.

But...where is your L3 gateway for each network?
ReplyDelete
Replies
That1guy15April 10, 2012 at 2:38 PM
Nice to see a post going off the beatin path!

This looks like a colapsed core design just without dedicated aggregation switches for each access module. I can definitly see the flexability benifit of this but I would be concerned about the scaling of the aggregation switches when you need to grow them. But on the other hand when you grow them to this point you are getting into a good size DC and might need to consider dropping in a core.

Thanks for the post and getting my design gears turning this morning!!
ReplyDelete
Replies
RPMApril 10, 2012 at 10:10 PM
Most vendors seem to be showing Clos-style designs for their fabrics, rather than a full mesh (or partial mesh topologies like Torus or Hypercube).

Clos networks have a constant number of hops between any two hosts, and can scale to ridiculous proportions: more than 65536 host-facing ports with 64-port routers and 3 stages, with ZERO oversubscription. Of course, that comes at the cost of lots of long fiber links between layers as well as packaging and cabling issues.

Vendors also seem to be showing security and edge routers down at the leaves of the Clos network, which is fine since bisectional bandwidth is constant.

Coming back to earth, HPC sites have used Torus/Hypercube topologies for years. They cost a *lot* less to build, and will probably work fine for the majority of datacenters, despite the slightly variable latency and number of hops. When each hop is less than 1μs, who cares if it takes 7 or 9 layer 2 hops to get across a datacenter? The bisection bandwidth is still there.
ReplyDelete
Replies
WillApril 18, 2012 at 3:20 AM
Wait a sec....3 Aggs? When people say Agg i think 7Ks. Is there a model for 3 7Ks?
ReplyDelete
Replies
chris margetApril 18, 2012 at 11:26 AM
@Will - It's a totally stupid design, but it's not mine. Like I said, it's Brocade's competitive example, and that's why I used it. You could do this, but it's ridiculous. It's a slightly cleaned-up version of the one that appears in the Brocade presentation I linked above. I'm disappointed that I haven't had any takers on the "pop quiz" I mentioned, because if you think this topology is ugly, you should see the presentation :-)
ReplyDelete
Replies
tmaguiresnowApril 19, 2012 at 4:38 PM
my first choice for root would be the first agg switch from the left, but even the Core on the right is blocking a link to it.
ReplyDelete
Replies
xzatechMay 3, 2012 at 4:51 AM
I'm going with one of the Core's being the root.
ReplyDelete
Replies

Add comment