Monday, July 20, 2015

Link Aggregation on HP Moonshot - A Neat Trick

The Broadcom switching OS running on HP's Moonshot 45G and 180G switches can do a neat trick1 that I haven't seen on other platforms.

Background: LACP-Individual
The trick revolves around interfaces that are sometimes aggregated, and sometimes run as individuals. Lots of platforms don't support this behavior. On those platforms, if an interface is configured to attempt aggregation but doesn't receive LACP PDUs, the interface won't forward traffic at all. Less broken platforms make this behavior configurable or have some goofy in-between mode which allows one member of the aggregation to forward traffic.

If the Moonshot were saddled with one of these broken2 switching OSes, we'd be in a real pickle: Moonshot cartridges (my m300s, anyway) require PXE in order to become operational, and PXE runs in the option ROM of an individual network interface. Even if that interface could form an one-member aggregation, it wouldn't be able to coordinate its operation with the other interface, and neither of their LACP speaker IDs would match the one chosen by the operating system that eventually gets loaded.

I suppose we could change the switch configuration: Add and remove individual interfaces from aggregations depending on the mode required by the server from one moment to the next, but that's pretty clunky, and the standard anticipated this requirement, so why bother?

It's been suggested to me that running a static (non-negotiated) aggregation could be a solution to the problem, but it introduces ECMP hashing toward the server. If we hash PXE traffic (DCHP, TFTP, etc...) so that it's delivered to the wrong NIC, the server won't boot. With the way ECMP decisions get made in the Broadcom switch stack, this suggestion can work, but only if we eliminate all but one link in and out of the chassis. Why live with spanning tree when we've got an expensive L3 switch and lots of physical uplinks here?

So, what's the trick?
On most platforms, the configuration applied to an interface (VLANs, mode, spanning tree stuff, etc...) is required to match the configuration of the aggregation. If the individual interface doesn't match the aggregate, it gets suspended.

Broadcom, for some reason, didn't see a need to implement this check. In their network OS, the physical interfaces can be configured completely differently from the aggregation. Like this:

 interface 1/0/1,2/0/1  
  vlan pvid 7  
  vlan participation include 7  
  addport 0/3/1  
 interface lag 1  
  no port-channel static  
  vlan participation include 51,61  
  vlan tagging 51,61  

This configuration make ports 1/0/1 and 2/0/1 (connected to eth0 and eth1 of the first compute node) access ports in VLAN 7, and makes them want to join aggregation lag 1, which is a trunk carrying VLANs 51 and 61.

If the server sends LACP PDUs, those interfaces aggregate and trunk VLANs 51,61. If the server doesn't send PDUs, the interfaces are access ports in VLAN 7,

Now we can do cool stuff like:

  • Run the cartridges completely diskless. They boot up via PXE in VLAN 7. After loading a kernel and filesystem into memory, the NICs aggregate into a trunk carrying different VLANs.
  • Run servers on boot-up through some PXE-controlled self-tests and patching in VLAN 7, then chain-boot into the real (on disk, perhaps?) OS which will aggregate the interfaces.
In both cases, VLAN 7 is gone from the switch ports once the LACP messages roll in. The only way the server would find VLAN 7 again is to de-aggregate the interfaces.

I briefly considered adding port-channel min-links 2 to the aggregation config. Doing so would make sure that the server could see only VLAN 7 OR VLAN 51,61 at any given moment, but never all 3 VLANs at the same time. Doing so kills redundancy, so that plan is out.

It's not really a security mechanism, but it does make the access VLAN pretty-much inaccessible, and reduces the footprint of that broadcast domain much more than merely making it un-tagged would do.




1 Beginning with version 2.0.3.0.
2 Check out 802.3AX-2008, section 5.3.9.