Monday, September 6, 2010

Link Aggregation, Load Balancing and Redundancy

Link Aggregation is a pretty easy to grasp technology:  Take many links, bundle them into a single logical link.  No mystery there.  When a frame is destined to traverse the aggregated link the switch needs to make a decision: Which member link of the aggregation should be used?

While it might seem convenient, the switch must not round-robin the frames across the members of the aggregation, because that could lead to packets belonging to a particular flow arriving at their destination in a different order than they were sent by the originating station.  Out-of-order delivery can be a big problem for some flows, so that sort of behavior is explicitly forbidden.   Ordered frame delivery is one of the hard invariants of LAN bridging behavior, and is codified by ISO/IEC Standard 15802-1.  The switch employs a deterministic algorithm to ensure that every frame belonging to a given flow crosses the same link.  This way, frames that need to stay in order aren't racing each other down parallel paths.  Note here that there is no requirement for the same link to be used for the flow's return traffic, nor is there a requirement to use the same link selection algorithm on both ends of the aggregation.  ...In fact, there are cases where choosing the same selection algorithm is just flat wrong.  Consider the following example:


Four servers are each pushing 30Mb/s across a 4-way aggregation to a gigabit attached router.  Lets assume Switch A selects the link for each frame with a modulus operation: src_mac % link_count.

Server A: 0x0C % 4 = 0
Server B: 0x1B % 4 = 3
Server C: 0x2A % 4 = 2
Server D: 0x39 % 4 = 1

Great!  The load will balance perfectly!  Every server's traffic will traverse a different link.  What about in the other direction?  The source MAC address on every frame will be 0000.0000.005D (the router).  Link 1 (0x5D % 4 = 1) will always be selected, no matter which server the router is talking to.

So, for this to load balance nicely in both directions, we want Switch A to balance according to source MAC addresses, and we want switch B to balance according to destination MAC addresses.

Failure Scenario
Back to looking at server-generated traffic.  We're still doing link selection by modulus of the source MAC address.  If one of our links fails, the switch changes the number used in the modulus operation.  Instead of taking the modulus by 4, it takes the modulus by 3 (the new link count).

Server A: 0x0C % 3 = 0
Server B: 0x1B % 3 = 0
Server C: 0x2A % 3 = 0
Server D: 0x39 % 3 = 0

Traffic distribution with 3 links

Now we're trying to push the aggregate of all four servers (120Mb/s) across a single 100Mb/s link.  Bummer.

Obviously we don't want this to happen.  We can add a fifth link so that a single failure won't drop us into the unlucky modulus-by-3 situation.  Lets see how things look with 5 links:

Server A: 0x0C % 5 = 2
Server B: 0x1B % 5 = 2
Server C: 0x2A % 5 = 2
Server D: 0x39 % 5 = 2

Traffic distribution with 5 links

Nuts.  This is not an improvement.  4 links balances beautifully, but the balance is completely upset if we use 3 or 5 links.  2 links isn't safe because that wouldn't provide any redundancy.  What to do?  Allow me to introduce the 'lacp max-bundle' command.  With 'lacp max-bundle 4', the switch will never bring more than 4 links into the aggregation.  The fifth link in the example above will be in a standby mode.  Should one of the 4 active links fail, the standby link will be brought into the aggregation so that you'll always have 4 (not 5!) links, even after a failure.

Reality
I made these examples simple, and I stacked the deck to make a point.  I concede that the hashing algorithm doesn't work exactly as I've described.  In reality, the frames are hashed into a fixed number of buckets which themselves distributed among links.  Depending on how the chips fall, this may make the situation better or worse.  ...But it doesn't make is simpler :-)  I'm also aware that the selection algorithms can make use of more than just the MAC addresses (in fact, that ability is also codified in the ISO standard I cited -- the earlier standard would not have allowed information above layer 2 to be used for this purpose).

The general point is to consider your traffic carefully, and not use a one-size-fits-all approach to configurations.  Then test.  The problem outlined here (removing one link forces too much traffic onto a single member of the aggregation) is one I've encountered in a customer's network.

No comments:

Post a Comment