Thursday, January 26, 2012

Building Nexus vPC Keepalive Links

There's some contradictory and unhelpful information out there on vPC peer keepalive configuration. This post is a bit of a how-to, loaded with my opinions about what makes sense.

What Is It?
While often referred to as a link, vPC peer keepalive is really an application data flow between two switches. It's the mechanism by which the switches keep track of each other and coordinate their actions in a failure scenario.

Configuration can be as simple as a one-liner in vpc domain context:
vpc domain <domain-id>
  peer-keepalive destination 
<peer-ip-addr>
Cisco's documentation recommends that you use a separate VRF for peer keepalive flows, but this isn't strictly necessary. What's important is that the keepalive traffic does not traverse the vPC peer-link nor use any vPC VLANs.

The traffic can be a simple L2 interconnect directly between the switches, or it can traverse a large routed infrastructure. The only requirement is that the switches have IP connectivity to one another via non-vPC infrastructure. There may also be a latency requirement - vPC keepalive traffic maintains a pretty tightly wound schedule. Because the switches in a vPC pair are generally quite near to one another I've never encountered any concerns in this regard.

What If It Fails?
This isn't a huge deal. A vPC switch pair will continue to operate correctly if the vPC keepalive traffic is interrupted. You'll want to get it fixed because an interruption to the vPC peer-link without vPC keepalive data would be a split-brain disaster.

Bringing a vPC domain up without without the keepalive flow is complicated. This is the main reason I worry about redundancy in the keepalive traffic path. Early software releases wouldn't come up at all. In later releases, configuration levers were added (and renamed!?) to control the behavior. See Matt's comments here.

The best bet is to minimize the probability of an interruption by planning carefully, thinking about the impact of a power outage, and testing the solution. Running the vPC keepalive over gear that takes 10 minutes to boot up might not be the best idea. Try booting up the environment with the keepalive path down. Then try booting up just half of the environment.

vPC Keepalive on L2 Nexus 5xxx
The L2 Nexus 5000 and 5500 series boxes don't give you much flexibility. Basically, there are two options:
  1. Use the single mgmt0 interface in the 'management' VRF. If you use a crossover cable between chassis, then you'll never have true out-of-band IP access to the device, because all other IP interfaces exist only in the default VRF, and you've just burned up the only 'management' interface. Conversely, if you run the mgmt0 interface to a management switch, you need to weigh failure scenarios and boot-up times of your management network. Both of these options SPoF the keepalive traffic because you've only got a single mgmt0 interface to work with.
  2. Use an SVI and VLAN. If I've got 10Gb/s interfaces to burn, this is my preferred configuration: Run two twinax cables between the switches (parallel to the vPC peer-link), EtherChannel them, and allow only non-vPC VLANs onto this link. Then configure an SVI for keepalive traffic in one of those VLANs.
vPC Keepalive on L3 Nexus 55xx
A Nexus 5500 with the L3 card allows more flexibility. VRFs can be created, and interfaces assigned to them, allowing you to put keepalive traffic on a redundant point to point link while keeping it in a dedicated VRF like Cisco recommends.

vPC Keepalive on Nexus 7000
The N7K allows the greatest flexibility: use management or transit interfaces, create VRFs, etc... The key thing to know about the N7K is that if you choose to use the mgmt0 interfaces, you must connect them through an L2 switch. This is because there's an mgmt0 interface on each supervisor, but only one of them is active at any moment. The only way to ensure that both mgmt0 interfaces on switch "A" can talk to both mgmt0 interfaces on switch "B" is to connect them all to an L2 topology.

The two mgmt0 interfaces don't back each other up. It's not a "teaming" scheme. Rather, the active interface is the one on the active supervisor.

IP Addressing
Lots of options here, and it probably doesn't matter what you do. I like to configure my vPC keepalive interfaces at 169.254.<domain-id>.1 and 169.254.<domain-id>.2 with a 16-bit netmask.

My rationale here is:
  • The vPC keepalive traffic is between two systems only, and I configure them to share a subnet. Nothing else in the network needs to know how to reach these interfaces, so why use a slice of routable address space?
  • 169.254.0.0/16 is defined by RFC 3330 as the "link local" block, and that's how I'm using it. By definition, this block is not routable, and may be re-used on many broadcast domains. You've probably seen these numbers when there was a problem reaching a DHCP server. The switches won't be using RFC 3927-style autoconfiguration, but that's fine.
  • vPC domain-IDs are required to be unique, so by embedding the domain ID in the keepalive interface address, I ensure that any mistakes (cabling, etc...) won't cause unrelated switches to mistakenly identify each other as vPC peers, have overlapping IP addresses, etc...
The result looks something like this:
vpc domain 25
  peer-keepalive destination 169.254.25.2 source 169.254.25.1 vrf default
vlan 2
  name vPC_peer_keepalive_169.254.25.0/16
interface Vlan2
  description vPC Peer Keepalive to 5548-25-B
  no shutdown
  ip address 169.254.25.1/16
interface port-channel1
  description vPC Peer Link to 5548-25-B
  switchport mode trunk
  switchport trunk allowed vlan except 1-2
  vpc peer-link
  spanning-tree port type network
  spanning-tree guard loop
interface port-channel2
  description vPC keepalive link to 5548-25-B
  switchport mode trunk
  switchport trunk allowed vlan 2
  spanning-tree port type network
  spanning-tree guard loop
interface Ethernet1/2
  description 5548-25-B:1/2
  switchport mode trunk
  switchport trunk allowed vlan 2
  channel-group 2 mode active
interface Ethernet1/10
  description 5548-25-B:1/10
  switchport mode trunk
  switchport trunk allowed vlan 2
  channel-group 2 mode active
The configuration here is for switch "A" in the 25th pair of Nexus 5548s. Port-channel 1 on all switch pairs is the vPC peer link, and port-channel 2 (shown here) carries the peer keepalive traffic on VLAN 2.

29 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Great post!

    Have you thought about using "dual-active exclude interface-vlan 2"?

    ReplyDelete
  3. Hey!

    I´m quite new to this Nexus world and your nice posts are exactly what i need!
    Thanks a lot!

    Best Regards,
    Johan

    ReplyDelete
  4. @Johan - You're welcome, thanks for letting me know that the post was helpful!

    @Markku - Vlan 2 isn't a "vPC VLAN" because it's not allowed onto Po1. I expect it to be immune from the sort of shutdown that 'dual-active exclude' would protect it from (says chris without testing).

    ReplyDelete
  5. Oops, that's right, I missed the peer link "except" configuration.

    ReplyDelete
  6. Probably a dumb question (get used to it from me), but why are you running Loopguard? Doesn't Bridge Assurance take its place (and do everything "better")?

    ReplyDelete
  7. Hey Colby, that's a really *good* question. ...And not your first one, I might add :-)

    I've been pondering it for a couple of days, and I think you're probably right.

    I'd been thinking of Bridge Assurance mainly as a mechanism for upstream switches to monitor downstream switches, because that's the new functionality introduced by BA. But as you point out, it works in both directions. Upstream and downstream switches monitor *each*other* with BA.

    Loop Guard OTOH only specifies the response of the downstream switch in the face of loss of the upstream.

    I'm thinking you're probably right. Loopguard doesn't seem to be adding anything to this configuration.

    Thanks for pointing that out!

    ReplyDelete
  8. if you are to use rfc3330 for PeerKeepAlive, then you just found your reason why you want it in a specific VRF other than default. Just to make sure your routing table in default will never have theses routes.

    In the future if you wish to start other protocol in the default, such as LDP, IGP xyz, it'll be cleaner.


    dan

    ReplyDelete
  9. "if you are to use rfc3330 for PeerKeepAlive, then you just found your reason why you want it in a specific VRF"

    On Nexus 7K there's no reason not to use a "keepalive" VRF. ...Though I don't see much downside to having these special addresses in the default VRF, I definitely wouldn't allow them to propagate via a routing protocol.

    On Nexus 5K, it's more of a problem. Whatever address you use, it'll land in the default VRF (unless you use the mgmt0 interface), and it won't be in the IGP (unless you're using an L3 5K -- a box for which I have yet to see a need).

    ReplyDelete
  10. The problem with using SVI for keep-alive is incapability to upgrade using ISSU. As I understand the VLAN is local to the 5ks and one of them will be root and will have non-edge designated port. This will cause ISSU to be disruptive.

    ReplyDelete
    Replies
    1. I know its old but you can set the interface that the PK vlan is running over to Spanning-tree port type edge and enable bpdu filtering to allow for an ISSU upgrade with the SVI PK method.

      Delete
  11. I'm definitely new to the 7Ks and had a question regarding VDCs and how the keep-alive link works. In short, do i need to use a separate physical link for each VDC instance for a keep-alive link? I know that each VDC requires physical VPC peerlinks (eg. 2 physical links for VDC1 and 2 physical links for VDC 2) but can i use one keep-alive link to mangage both VDC's VPCs? Hope that question makes sense. Thanks for the help

    ReplyDelete
  12. VDCs own physical ports, so if you're using line card ports, then yes: physical port/cable per VDC.

    OTOH, if you run keepalive traffic over the mgmt0 links (through an L2 infrastructure!) then the VDCs can all share one (or two) cables.

    ReplyDelete
  13. I ran into a weird bug when configuring our Nexus 5548's referenced here:
    https://supportforums.cisco.com/thread/2151975

    The reason why I bring it up is because I wasn't aware of the caveat of not having management traffic traverse the VPC Peer-Link and your VPC information saved me. You're right, your write-up is the only one that is actually accurate, there's a TON of incorrect information out there. Just wanted to say thanks!

    ReplyDelete
  14. Hey Ethan,

    Thank you for taking the time to let me know you appreciate this post!

    /chris

    ReplyDelete
  15. fyi: a dedicated vrf can be created and used on the vpc keepalive link svi on a layer 2 only nexus 5k

    ReplyDelete
  16. "a dedicated vrf can be created and used on the vpc keepalive link svi on a layer 2 only nexus 5k"

    Something new then? This definitely wasn't possible back when I was working on these boxes.

    ReplyDelete
  17. Chris, I was just listening to the Packet Pushers podcast episode you guested on for the Nexus deep-dive. I really enjoyed your input. However, in the episode you recommended using the management interfaces for building the peer-link keep-alive. However, in this article you recommend using a 10-gig port cross-connect w/ a keep-alive vlan and SVI. Also, if I have a pair of Nexus 5648's, while I can now create a keep-alive specific VRF, should I waste a 40gig port on each or just use the management ports?

    ReplyDelete
    Replies
    1. I changed my thinking on this point between this post and the PP recording. The big problem with a VLAN for keep-alive is the headache it creates for ISSU.

      Given the safety mechanisms now available at vPC start-up time, the big risks associated with split-brained-ness after a DC blackout can be mitigated without resorting to back-to-back for vPC keepalive traffic.

      These days, I'd just push the keepalive traffic through your L2 management infrastructure and not worry too much about it. Two tips:
      1) test bringing up both peers with the management ports offline
      2) try to land the management cables on the same management switch/card/asic

      Delete
    2. Thanks Chris, I really appreciate your response and input to my question. Have you since done any other podcasts where you discuss the Nexus platform? If not could you steer me to any such podcast discussions? I'm trying to design a 5648 w/ 2248 Fex distribution layer. We don't have any 7K's so it's been interesting to say the least.

      Delete
    3. I think you've found all of my Nexus stuff. Big decision for you is going to be whether you dual home the FEXen. This decision will probably hinge on the equipment this network will be supporting, and it's redundancy needs.

      Delete
  18. Sorry, one more question, in the PP deep dive, you mention that another port-channel between the two Nexus chasis is required for non-vpc vlans in an active/standby fail-over situation. Assuming then that aside from the peer-link, will I need to also make a cross-connect between the 5k's and config on it a port-channel that will accomadate the non-vpc vlans (i.e. those vlans not allowed over the peer-link)?

    ReplyDelete
    Replies
    1. *IF* you want to have non-vPC VLANs, then they'll need their own cross connect.

      Having non-vPC VLANs isn't a foregone conclusion, however. It depends on your requirements.

      Delete
  19. Hi everyone

    can some one explain, y we will place peer-keepalive in vrf management domain, but not peer-link

    ReplyDelete
    Replies
    1. Peer keepalive is an application flow. It has source and destination IP addresses. IP addresses/interfaces/tables are things that make sense in the context of VRF.

      Peer link, on the other hand is an L2-only construct. It knows nothing about IP addresses. Heck, the peer link and vPCs associated might be transporting something other than IP traffic. VRF isn't a concept that really applies to a physical link (which might be doing transport for several VRFs.)

      Delete
  20. Great article,
    Maybe one can solve a problem for me, involving vpc's and private-vlans.
    At my company we use vpc's between Nexus 5k (vpc) and ASR9k(mlacp). The vpc portchannel are configured as a trunk with vlans allowed (lacp).
    Now we want to change it to "switchport mode private-vlan trunk promiscuous".
    I have done some testing, when removing the configuration and pasting the private-vlan config, there will be outage.

    All the technics like "Graceful Consistency Check" and "lacp suspend-individual", are in place.

    Is there a way to change this configuration without outage?
    Is the solution in: 2. Use an SVI and VLAN (of above article)?

    I really appreciate a reply.

    ReplyDelete
  21. I just wanted to add a bit of clarification regarding your use of link local addressing on the vPC keepalive link, since while it works, I was not sure if it is RFC compliant. As you mentioned, you're not using RFC3927 auto-configuration. RFC3927 states the following:

    1.6. Alternate Use Prohibition

    Note that addresses in the 169.254/16 prefix SHOULD NOT be configured
    manually or by a DHCP server.
    ...
    Administrators wishing to configure their own local addresses (using
    manual configuration, a DHCP server, or any other mechanism not
    described in this document) should use one of the existing private
    address prefixes [RFC1918], not the 169.254/16 prefix.

    The rationale stated in the RFC is that that this could cause the host not to follow special rules regarding duplicate detection and auto-configuration. However, this isn't relevant to our particular use case because this is a direct link with no other devices on the same L2 segment. No other devices should ever exist on this segment, and therefore duplicate detection should not be required.

    RFC2119 states the following about the wording “SHOULD NOT”:

    SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that
    there may exist valid reasons in particular circumstances when the
    particular behavior is acceptable or even useful, but the full
    implications should be understood and the case carefully weighed
    before implementing any behavior described with this label.

    Therefore, since we understand the full implications of the link local address space, I believe using link local address space on vPC keepalive links (or other similar links) is a valid use of the address space as per the RFC.

    ReplyDelete
  22. Hi Matt,

    Thanks for your thoughts on the matter. I did similar research, came to the same conclusion.

    Much of my decision in this regard hinges on the fact that 169.254/16 was designated for "link local" use long before RFC3927 came around. The document (dated 2005) mentions the fact that Win98 was already using this address block. RFC3927 didn't invent the block, it's merely trying to improve interoperability between devices auto configuring themselves within the same broadcast domain.

    One other detail: I don't think the network devices in question actually know that this /16 is special. Of course, that may change :)

    ReplyDelete
    Replies
    1. All great points. Also, thanks for the great post! This post definitely came in handy on a Nexus project I was working on.

      Delete