Friday, October 1, 2010

vPC failure scenario

Cisco Nexus vPC operation is well documented all over the 'net, so I won't be covering the basics here.  Instead, I want to focus on a particular failure scenario, in which the vPC safety mechanisms can indefinitely prolong downtime when other failures have occurred.

Consider the following topology:

Nothing fancy is going on here.  Nexus 5000s have 3 downstream (south-facing?) vPC links and a redundant vPC peer-link.  The management interface on each Nexus is doing peer-keepalive duty through a management switch.

My previous builds have been somewhat paranoid about redundancy of the peer-keepalive traffic, but I no longer believe that's helpful, and I'll be doing keepalive over the non-redundant mgmt0 interface going forward.

Each Nexus knows to bring up its vPC link members because the peer-link is up, so the activity can be coordinated between chassis.  If the peer-link fails each Nexus pair can still coordinate their vPC forwarding behavior by communicating each other's state over the peer-keepalive management network.

If a management link (or the whole management switch) were to fail, then no problem.  It's the state-only backup to the peer-link, and not required to forward traffic.

If a whole Nexus switch fails, the surviving peer will detect the complete failure of his neighbor, and continue forwarding traffic for the vPC normally.

When the failed Nexus comes back up, he waits until he's established communication with the survivor before bringing up the vPC member links, because it wouldn't be safe to bring up aggregate link members without knowing what the peer chassis is doing.

...And that brings us to the interesting failure scenario:  Imagine that a power outage strikes both Nexus 5Ks, but only one of them comes back up.  The lone chassis can't reach his peer over the peer link or the peer-keepalive link.  He's got no way of knowing whether it's safe to bring up the vPC member links, so they stay down.

If this happened to me in production, I'd probably do two things to bring it back online:

  1. Take steps to ensure the failed box can't come back to life.  How do you kill a zombie switch, anyway?
  2. Remove the vpc statement from each vPC port channel interface
Nortel had a similar problem with their RSMLT mechanism, but that deadlock centered around keeping track of who owns the first-hop gateway address (not HSRP/VRRP/GLBP).  They solved it by recording responsibility for the gateway address into NVRAM (flash? spinning rust?  wherever it is that Nortel records such things).

12 comments:

  1. err-disable recovery might help you. Its in Nx7K. I guess it come one day to Nx5k

    http://www.cisco.com/en/US/partner/docs/switches/datacenter/sw/4_2/nx-os/interfaces/command/reference/if_commands.html#wp1359960

    ReplyDelete
  2. Hi, Dan.

    I haven't tested this, but I'm not confident that this will help. The scenario described here doesn't smell like an err-disabled port to me.

    ReplyDelete
  3. I'm trying to design my first n7k + n5k + n2k deployment and my first use of vpc, and after reading all the design guides and playing in the lab I came up with the same conclusion about vpc not coming up if only one switch returns after a power outage. This is rather terrifying. During a power outage situation staff is typically overloaded. The last thing we want is more things that we have to manually mess with during the outage.

    I was also pondering redundancy for the vpc keepalives. At first I came to the same conclusion as you, but then I got to thinking about the power outage scenario. What if there's a power outage and both vpc peers return, but the (usually low end and even single power source) management switch does not return? Now we're stuck with the VPC down. This makes me think it is probably better to put the vpc keepalive on a direct link between the switches or even a redundant 2x bundle directly between peers. Yes, I feel silly using 2x 10gig links for keepalives, but it sounds like this is the way to go for maximum reliability/survivability.

    ReplyDelete
  4. Hi Matt,

    The failure scenario is a scary one, but will be addressed in an upcoming release, according to rumors I've heard. I have no details on that.

    My experience with the failure scenario was actually my first N5K production deployment: One of the switches was DOA (twice!), and I couldn't bring up the environment on schedule.

    For my customers, this failure mode isn't too big of a concern: If both members of a vPC go down, big DR/BCP wheels start turning, and processing is migrated to a DR facility. These customers wouldn't roll back into the primary site with half of a switching tier down anyway, so the inability to bring a vPC back up isn't a show-stopper.

    The vPC keepalive link is already a redundant piece. I'm pretty sure that the vPC will come up even with the keepalive link down (so long as the peer link is up).

    ReplyDelete
  5. By default, vPC will not come up without keepalives. All I have to do is admin shutdown my keepalive links and reboot both switches. The vPC will never be restored and any client on a "redundant" vPC link will just stay down forever.

    I spent last week in San Jose, including 2 full days at Cisco. During that time, I expressed my concern about the issue of vPC restoration after a power event in the DC. They said that the power issue was resolved recently in code that is now out. I did some searching, and I believe "auto-recovery" is the command. I have not yet had a chance to test this in the lab.

    This would also appear to resolve my concerns about making the vPC keepalive link redundant. I was going to dedicate 2 10g links between vPC pairs for keepalive redundancy, because I feared that the management switch (typically single power sourced) might not come back after a power event, leaving all my vPCs down. If they can auto-restore vPCs using this command even when the management/keepalive network is down, then my fear of using the management port goes away.

    I'll try to get this tested next week and report back.

    Many of my customers have DR/GSLB, but of course that doesn't stop them from demanding SLA credits for every minute that any location is down. Having both switches reboot due to a power blip is painful. Having service not immediately come back (due to one-side hardware/power failure) could easily increase my SLA credits by hundreds of thousands of dollars.

    ReplyDelete
  6. Hi Matt,

    vPC won't come up if the keepalive link is down? Huh, I wonder where I collected the wrong impression about that.

    vPC will continue to operate with a failed keepalive link though, right?

    So, we have two power restoration failure vectors:
    - failure of one switch
    - failure of the keepalive path (assuming use of the mgmt0 ports)

    I've just perused the documentation for 'auto-recovery', which indicates it will handle the "dead peer" power restoration scenario.

    Will it handle the "dead keepalive path" scenario? Please report back. I don't have any Nexus available to me right now.

    Thanks!

    ReplyDelete
  7. After some testing of my own and some meetings with Cisco, it is confirmed.

    After a power loss, if the peer is completely dead (no keepalive connection, and link is down on the peer-link), then auto-recovery will (eventually) bring the single remaining switch online.

    After a power loss, if both Nexus switches are fine, the peer-link is up, but keepalive is down (for example, if you put it on mgmt0 and your management switch never returned), vPC will never recover.

    Because of this, it seems to me that the only safe thing to do is to use a direct cable between vPC pairs. I'm assuming that it is unlikely that you'll have a power event AND the failure of this direct cable between switches at the same time. If you're worried about that, use 2 ports in an Etherchannel.

    If the peer keepalive link fails, and nothing else does, you're fine for now, but you NEED to make it a priority to get that link back ASAP. Once it is down, your switches will not survive a power loss, both crashing, or any other scenario where they both reboot.

    I expressed my unhappiness with this to everyone at Cisco. I suggested that perhaps a "peer-keepalive ... secondary" command would be one way to resolve this. That way, I could put the normal peer-keepalive on the management network, and then put a secondary peer-keepalive on the default vrf riding across the VLANs on the peer-link. Yes, I know putting the peer-keepalive on the peer-link is a bad idea, but putting it there as a backup path sure bets the entire switch pair being down.

    ReplyDelete
  8. What you are looking for is a feature that was added to NXOS 5 code.
    Auto-recovery
    If after a default of 4 minutes the primary vpc nexus does not see a secondary it will go on as though it did.
    I had a pair of 5548s for testing and this command works like a charm.

    Here it is in my config.

    vpc domain 1
    peer-keepalive destination A.B.C.D
    auto-recovery

    ReplyDelete
  9. Robert: auto-recovery will only kick in if the keepalive doesn't establish AND the peer link is down.

    Imagine you put your keepalive on mgmt0 to a small 2960 or something similar for your management network. Now imagine there is a power outage. After the power outages, both Nexus switches boot back up, but your management switch does not boot up.

    In this situation, the Nexus switches will not be able to establish keepalives, but they will see their peer-link as up. Because the peer-link is up, auto-recovery will not kick in. The VPCs will stay down forever.

    For this reason, I see putting keepalives on the management interface as a terrible idea. I'm doing 4x10gig links between every VPC pair I deploy. Two of those links are for the VPC peer link, and two are for the VPC keepalive link. It sure seems wasteful, but having the switches both not recover from a power outage just because one piece of hardware failed is unacceptable.

    Datacenter power outages are typically followed by a mad scramble because some percentage of the hardware (that has worked fine for years) never comes back. That situation is already a disaster. I don't want it to be worse because the one management switch's death caused every single Nexus in my DC to not come back, requiring manual configuration changes to every switch to restore service... I need the network to come back as best it can as quickly as it can, leaving us to focus on the hardware that actually failed.

    On a somewhat unrelated note, I'm also terribly annoyed that they appear to have implemented auto-recovery as two different commands on the 5k and 7k platforms.

    On 7k:

    vpc domain 1
    reload restore delay 240

    On 5k:

    vpc domain 2
    auto-recovery reload-delay 240

    As far as I know, they both do the exact same thing. It was just implemented as different commands on different platforms for some reason.

    ReplyDelete
  10. Using the mgmt1 port for a backup keepalive that is connected back-to-back seems like a good idea. Wonder why they havn't implemented that.

    ReplyDelete
  11. mgmt1 port doesn't exist. There are some extra ports physically present on the chassis, but they don't do anything.

    ReplyDelete
  12. Matt, you probably know this already, but the 7K now has the auto-recovery command, which replaces reload restore. Effective in NXOS 5.2(1). On the 5K, auto-recovery became active in version 5.0(2)N2(1).

    Anyone know the effects of a down peer-link, after a reboot? I assume the vPCs will stay down on both switches, since no vpc consistency check can be made. Also, was wondering what happens when both the peer-link and keep alive die during normal operations (with no reboot)? I assume all vPC ports will suspend on both switches, since one switch does not know if the other is alive.

    ReplyDelete