tag:blogger.com,1999:blog-3266263034124005485.post1238094067843690598..comments2024-03-24T23:19:30.504+00:00Comments on Fragmentation Needed: vPC failure scenariochris margethttp://www.blogger.com/profile/09716555871346949419noreply@blogger.comBlogger12125tag:blogger.com,1999:blog-3266263034124005485.post-17489600243811987952012-05-22T05:05:07.395+01:002012-05-22T05:05:07.395+01:00Matt, you probably know this already, but the 7K n...Matt, you probably know this already, but the 7K now has the auto-recovery command, which replaces reload restore. Effective in NXOS 5.2(1). On the 5K, auto-recovery became active in version 5.0(2)N2(1). <br /><br />Anyone know the effects of a down peer-link, after a reboot? I assume the vPCs will stay down on both switches, since no vpc consistency check can be made. Also, was wondering what happens when both the peer-link and keep alive die during normal operations (with no reboot)? I assume all vPC ports will suspend on both switches, since one switch does not know if the other is alive.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-3266263034124005485.post-76931425492345699242012-02-13T17:12:42.501+00:002012-02-13T17:12:42.501+00:00mgmt1 port doesn't exist. There are some extra...mgmt1 port doesn't exist. There are some extra ports physically present on the chassis, but they don't do anything.chris margethttps://www.blogger.com/profile/09716555871346949419noreply@blogger.comtag:blogger.com,1999:blog-3266263034124005485.post-61697349832092718522012-02-13T16:54:50.385+00:002012-02-13T16:54:50.385+00:00Using the mgmt1 port for a backup keepalive that i...Using the mgmt1 port for a backup keepalive that is connected back-to-back seems like a good idea. Wonder why they havn't implemented that.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-3266263034124005485.post-88188316502515056302011-07-08T20:29:31.619+01:002011-07-08T20:29:31.619+01:00Robert: auto-recovery will only kick in if the ke...Robert: auto-recovery will only kick in if the keepalive doesn't establish AND the peer link is down.<br /><br />Imagine you put your keepalive on mgmt0 to a small 2960 or something similar for your management network. Now imagine there is a power outage. After the power outages, both Nexus switches boot back up, but your management switch does not boot up.<br /><br />In this situation, the Nexus switches will not be able to establish keepalives, but they will see their peer-link as up. Because the peer-link is up, auto-recovery will not kick in. The VPCs will stay down forever.<br /><br />For this reason, I see putting keepalives on the management interface as a terrible idea. I'm doing 4x10gig links between every VPC pair I deploy. Two of those links are for the VPC peer link, and two are for the VPC keepalive link. It sure seems wasteful, but having the switches both not recover from a power outage just because one piece of hardware failed is unacceptable.<br /><br />Datacenter power outages are typically followed by a mad scramble because some percentage of the hardware (that has worked fine for years) never comes back. That situation is already a disaster. I don't want it to be worse because the one management switch's death caused every single Nexus in my DC to not come back, requiring manual configuration changes to every switch to restore service... I need the network to come back as best it can as quickly as it can, leaving us to focus on the hardware that actually failed.<br /><br />On a somewhat unrelated note, I'm also terribly annoyed that they appear to have implemented auto-recovery as two different commands on the 5k and 7k platforms.<br /><br />On 7k:<br /><br />vpc domain 1<br /> reload restore delay 240<br /><br />On 5k:<br /><br />vpc domain 2<br /> auto-recovery reload-delay 240<br /><br />As far as I know, they both do the exact same thing. It was just implemented as different commands on different platforms for some reason.Matthttps://www.blogger.com/profile/11794878125092413919noreply@blogger.comtag:blogger.com,1999:blog-3266263034124005485.post-30311269736444163282011-03-20T14:18:23.879+00:002011-03-20T14:18:23.879+00:00What you are looking for is a feature that was add...What you are looking for is a feature that was added to NXOS 5 code.<br />Auto-recovery<br />If after a default of 4 minutes the primary vpc nexus does not see a secondary it will go on as though it did.<br />I had a pair of 5548s for testing and this command works like a charm.<br /><br />Here it is in my config.<br /><br />vpc domain 1<br /> peer-keepalive destination A.B.C.D<br /> auto-recoveryrobert rowlandnoreply@blogger.comtag:blogger.com,1999:blog-3266263034124005485.post-59182502882895429832011-02-12T00:36:24.711+00:002011-02-12T00:36:24.711+00:00After some testing of my own and some meetings wit...After some testing of my own and some meetings with Cisco, it is confirmed.<br /><br />After a power loss, if the peer is completely dead (no keepalive connection, and link is down on the peer-link), then auto-recovery will (eventually) bring the single remaining switch online.<br /><br />After a power loss, if both Nexus switches are fine, the peer-link is up, but keepalive is down (for example, if you put it on mgmt0 and your management switch never returned), vPC will never recover.<br /><br />Because of this, it seems to me that the only safe thing to do is to use a direct cable between vPC pairs. I'm assuming that it is unlikely that you'll have a power event AND the failure of this direct cable between switches at the same time. If you're worried about that, use 2 ports in an Etherchannel.<br /><br />If the peer keepalive link fails, and nothing else does, you're fine for now, but you NEED to make it a priority to get that link back ASAP. Once it is down, your switches will not survive a power loss, both crashing, or any other scenario where they both reboot.<br /><br />I expressed my unhappiness with this to everyone at Cisco. I suggested that perhaps a "peer-keepalive ... secondary" command would be one way to resolve this. That way, I could put the normal peer-keepalive on the management network, and then put a secondary peer-keepalive on the default vrf riding across the VLANs on the peer-link. Yes, I know putting the peer-keepalive on the peer-link is a bad idea, but putting it there as a backup path sure bets the entire switch pair being down.Unknownhttps://www.blogger.com/profile/04684230920854245133noreply@blogger.comtag:blogger.com,1999:blog-3266263034124005485.post-73569787932084747122011-01-17T15:21:32.179+00:002011-01-17T15:21:32.179+00:00Hi Matt,
vPC won't come up if the keepalive l...Hi Matt,<br /><br />vPC won't come up if the keepalive link is down? Huh, I wonder where I collected the wrong impression about that.<br /><br />vPC will continue to operate with a failed keepalive link though, right?<br /><br />So, we have two power restoration failure vectors:<br />- failure of one switch<br />- failure of the keepalive path (assuming use of the mgmt0 ports)<br /><br />I've just perused the documentation for 'auto-recovery', which indicates it will handle the "dead peer" power restoration scenario.<br /><br />Will it handle the "dead keepalive path" scenario? Please report back. I don't have any Nexus available to me right now.<br /><br />Thanks!chris margethttps://www.blogger.com/profile/09716555871346949419noreply@blogger.comtag:blogger.com,1999:blog-3266263034124005485.post-39932595750289126312011-01-15T21:57:08.606+00:002011-01-15T21:57:08.606+00:00By default, vPC will not come up without keepalive...By default, vPC will not come up without keepalives. All I have to do is admin shutdown my keepalive links and reboot both switches. The vPC will never be restored and any client on a "redundant" vPC link will just stay down forever.<br /><br />I spent last week in San Jose, including 2 full days at Cisco. During that time, I expressed my concern about the issue of vPC restoration after a power event in the DC. They said that the power issue was resolved recently in code that is now out. I did some searching, and I believe "auto-recovery" is the command. I have not yet had a chance to test this in the lab.<br /><br />This would also appear to resolve my concerns about making the vPC keepalive link redundant. I was going to dedicate 2 10g links between vPC pairs for keepalive redundancy, because I feared that the management switch (typically single power sourced) might not come back after a power event, leaving all my vPCs down. If they can auto-restore vPCs using this command even when the management/keepalive network is down, then my fear of using the management port goes away.<br /><br />I'll try to get this tested next week and report back.<br /><br />Many of my customers have DR/GSLB, but of course that doesn't stop them from demanding SLA credits for every minute that any location is down. Having both switches reboot due to a power blip is painful. Having service not immediately come back (due to one-side hardware/power failure) could easily increase my SLA credits by hundreds of thousands of dollars.Unknownhttps://www.blogger.com/profile/04684230920854245133noreply@blogger.comtag:blogger.com,1999:blog-3266263034124005485.post-76723252161133342582011-01-09T17:11:29.028+00:002011-01-09T17:11:29.028+00:00Hi Matt,
The failure scenario is a scary one, but...Hi Matt,<br /><br />The failure scenario is a scary one, but will be addressed in an upcoming release, according to rumors I've heard. I have no details on that.<br /><br />My experience with the failure scenario was actually my first N5K production deployment: One of the switches was DOA (twice!), and I couldn't bring up the environment on schedule.<br /><br />For my customers, this failure mode isn't too big of a concern: If both members of a vPC go down, big DR/BCP wheels start turning, and processing is migrated to a DR facility. These customers wouldn't roll back into the primary site with half of a switching tier down anyway, so the inability to bring a vPC back up isn't a show-stopper.<br /><br />The vPC keepalive link is already a redundant piece. I'm pretty sure that the vPC will come up even with the keepalive link down (so long as the peer link is up).chris margethttps://www.blogger.com/profile/09716555871346949419noreply@blogger.comtag:blogger.com,1999:blog-3266263034124005485.post-78154731346004322642011-01-09T13:35:44.302+00:002011-01-09T13:35:44.302+00:00I'm trying to design my first n7k + n5k + n2k ...I'm trying to design my first n7k + n5k + n2k deployment and my first use of vpc, and after reading all the design guides and playing in the lab I came up with the same conclusion about vpc not coming up if only one switch returns after a power outage. This is rather terrifying. During a power outage situation staff is typically overloaded. The last thing we want is more things that we have to manually mess with during the outage.<br /><br />I was also pondering redundancy for the vpc keepalives. At first I came to the same conclusion as you, but then I got to thinking about the power outage scenario. What if there's a power outage and both vpc peers return, but the (usually low end and even single power source) management switch does not return? Now we're stuck with the VPC down. This makes me think it is probably better to put the vpc keepalive on a direct link between the switches or even a redundant 2x bundle directly between peers. Yes, I feel silly using 2x 10gig links for keepalives, but it sounds like this is the way to go for maximum reliability/survivability.Unknownhttps://www.blogger.com/profile/04684230920854245133noreply@blogger.comtag:blogger.com,1999:blog-3266263034124005485.post-74317763865362916402010-10-10T14:55:22.170+01:002010-10-10T14:55:22.170+01:00Hi, Dan.
I haven't tested this, but I'm n...Hi, Dan.<br /><br />I haven't tested this, but I'm not confident that this will help. The scenario described here doesn't smell like an err-disabled port to me.chris margethttps://www.blogger.com/profile/09716555871346949419noreply@blogger.comtag:blogger.com,1999:blog-3266263034124005485.post-54787065810020541612010-10-09T17:20:13.399+01:002010-10-09T17:20:13.399+01:00err-disable recovery might help you. Its in Nx7K. ...err-disable recovery might help you. Its in Nx7K. I guess it come one day to Nx5k<br /><br />http://www.cisco.com/en/US/partner/docs/switches/datacenter/sw/4_2/nx-os/interfaces/command/reference/if_commands.html#wp1359960Danhttp://dans-net.comnoreply@blogger.com