Monday, January 23, 2012

Nexus vPC Orphan Ports

"Orphan Port" is an important concept when working with a Cisco Nexus vPC configuration. Misunderstanding this aspect of vPC operation can lead to unnecessary downtime because of some of the funny behavior associated with orphan ports.

Before we can define an orphan port, it's important to cover a few vPC concepts. We'll use the following topology.


Here we have a couple of Nexus 5xxx switches with four servers attached. The switches are interconnected by a vPC peer link so that they can offer vPC (multi-chassis link aggregation) connections to servers. The switches also exchange vPC peer-keepalive traffic over an out-of-band connection.

Lets consider the traffic path between some of these servers:
A->B
A->C
This traffic takes a single hop from "A" to its destination via S1.
B->A
C->A
The path of this traffic depends on the which link the server's hashing algorithm chooses. Traffic might go only through S1, or it might take the suboptimal path through S2 and S1 (over the peer link).
B->C
C->B
The path of this traffic is unpredicatable, but always optimal. These servers might talk to each other through S1 or through S2, but their traffic will never hit the peer link under normal circumstances.
A->D
D->A
This traffic always crosses the peer link because A and D are active on different switches.

Definitions:
vPC Primary / Secondary - In a vPC topology (technically a vPC domain), one switch is elected primary, and the other secondary according to configurable priorities and MAC address-based entropy. The priority and role is important in cases where the topology is brought up and down, because it controls how each switch will behave in these exceptional circumstances.

vPC peer link - This link is a special interconnection between two Nexus switches which allows them to collaborate in the offering of multi-chassis EtherChannel connections to downstream devices. The switches use the peer link to "get their stories straight" and unify their presentation of the LACP and STP topologies.

The switches also use the peer link to synchronize the tables they use for filtering/forwarding unicast and muliticast frames.

The peer link is the centerpiece of the most important thing to know about traffic forwarding in a vPC environment: A packet which ingresses via the peer link is not allowed to egress a vPC interface under normal circumstances.

This means that a broadcast frame from server A will be flooded to B, C and S2 by S1. When the frame gets to S2, it will only be forwarded to D. S2 will not flood the frame to B and C.

vPC peer keepalive - This is an IP traffic flow configured between the two switches. It must not ride over the peer link. It may be a direct connection between the two switches, or it can traverse a some other network infrastructure. The peer keepalive traffic is used to resolve the dual-active scenario that might arise from loss of the peer link.

vPC VLAN - Any VLAN which is allowed onto the vPC peer link is a vPC VLAN.

Orphan Port - Any port not configured as a vPC, but which carries a vPC VLAN. The link to "A" and both links to "D" are orphan ports.

So why do orphan ports matter?
Latency: Traffic destined for orphan ports has a 50/50 chance of winding up on the wrong switch, so it will have to traverse the peer link to get to its destination. Sure, it's only a single extra L2 hop, but it's ugly.

Bandwidth: The vPC peer link ordinarily does not need to handle any unicast user traffic. It's not part of the switching fabric, and it's commonly configured as a 20Gb/s link even if the environment has much higher uplinks and downlinks. Frames crossing the peer link will incur extra header (this is how S2 knows not to flood the broadcast to B and C in the previous example) and possibly overwhelm the link. I've only ever seen this happen in a test environment, but it was ugly.

Shutdown: This is the big one. If the peer link is lost, bad things happen. The vPC secondary switch (probably the switch that rebooted last, not necessarily the one you intend) will disable all of his vPC interfaces, including the link up to the distribution or core layers. In this case, server D will be left high-and-dry, unable to talk to anybody. Will server D flip over to his alternate NIC? Most teaming schemes decide to fail over based on loss of link. D's primary link will not go down.

If the switches are layer-3 capable, the SVIs for vPC VLANs will go down too, leaving orphan ports unable to route their way out of the VLAN as well.

No Excuse
There are configuration levers that allow us to work around these failure scenarios, but I find it easier to just avoid the problem in the first place by deploying everything dual-attached with LACP. Just don't create orphan ports.

We're talking about the latest and greatest in Cisco data center switching. It's expensive stuff, even on a per-port basis. Almost everything in the data center can run LACP these days (Solaris 8 and VMware being notable exceptions), so why not build LACP links?

12 comments:

  1. Hi Chris,

    Great article - really clarified my understanding of vPC peer-link and loop-prevention. I don't see this changing much with server vPC connected through FEXes, but how does FEX vPC change this behaviour?

    Cheers, and thanks again!

    Jon. (@xanthein)

    ReplyDelete
  2. Excellent Article. However, for the active/passive server (D), several constructors develop a tool to ping an address to determine if the network is always reachable via the NIC, though the link status is up. Very usefull for blade server.
    An example : For broadcom, it is the LiveLink

    ReplyDelete
  3. @Jon Still, You're welcome, thanks for taking the time to let me know the article was helpful.

    FEX doesn't change things much
    - FEX w/ single upstream 5K - we can just think of the FEX ports as 5K ports, nothing changes.
    - FEX w/ dual upstream 5K - think of the FEX (and attached servers) as working just like "B" and "C" in the example.
    - FEX w/ dual-Layer vPC - I haven't seen the new software yet, but we're supposed to be able to do vPC FEX with vPC host connections soon. It sounds like the best of both worlds!

    ReplyDelete
  4. @Sebastien,

    Yeah, Solaris with IPMP (pings the router) and ESX "beacon probe" (requires >2 NICs to be effective) are the schemes I had in mind when I said "most" teaming schemes rely on link state.

    Any idea how LiveLink works? A cursory googling hasn't turned up clues about the criteria it uses to detect network health. Whether or not it would be able to make a good decision in this case remains an open question in my mind.

    Either way, these are (probably) slow and (definitely) clunky mechanisms compared to LACP, which takes me back to my last point: Why *wouldn't* you use LACP?

    ReplyDelete
  5. Thanks for the posts, hope to see more on DC stuff ;)

    ReplyDelete
  6. Thanks for the post. Perhaps we can add on the list of non-LACP enabled hosts also XenServer. It is possible to enable LACP, but it is not officially supported setup by Citrix, thanks for being corrected if this has changed.

    As a side note, there is a bug in Linux kernel bonding driver (vanilla kernels up to 3.2) with respect to LACP protocol handling, when even suspended and uni-directional links (switch can't hear LACPDUs from server) are enabled for forwarding. Details at http://marc.info/?l=linux-netdev&m=131852651422444&w=2.

    ReplyDelete
  7. Just curious.

    What if it is 2 5Ks connected to 2 7Ks that are running VPC? The fex are single homed to a 5K each?

    Will we still face this problem?

    ReplyDelete
  8. The 5Ks are dual homed btw. Thanks

    ReplyDelete
  9. As an update, VMware vSphere 5.1 will support LACP on its Virtual Distributed Switch (vDS)! Hooray!

    ReplyDelete
  10. Today I found that on Nexus 7000 with NX-OS 5.2(4) if you create orphan port on secondary vpc peer, the switch will not learn mac address on that orphan port.

    ReplyDelete
  11. Interesting. I haven't noticed this behavior, but tend to stick to the "don't create orphans" philosophy, so it wouldn't have come up.

    Is this a bug, or does the behavior make sense for some reason?

    ReplyDelete
  12. DL380 Emulex card don't support iscsi with lacp, is there another workaround?

    ReplyDelete