Friday, November 26, 2010

HSRP and STP uplinkfast - What does one have to do with the other?

Are HSRP and STP uplinkfast related?  Do they interoperate in some way?  I don't think so, but I can't be sure.  Frustratingly, Cisco's TAC doesn't seem to be sure either.

While pondering Ethan's post about ARP timeouts, I noticed something funny:  Routers running HSRP send traffic to the STP uplinkfast reserved hardware address 0100.0ccd.cdcd.

This was completely unexpected.

We all know and love HSRP.  It multicasts HSRP messages to peer routers on 224.0.0.2 / 0100.5e00.0002, and it responds to ARP queries for the virtual IP with one of 3 options:

  • The HSRP standard MAC in the range 0000.000c.ac??
  • A manually address with standby mac-address xxxx.xxxx.xxxx
  • The burned-in address with standby use-bia

Other than HSRP coordination traffic (hellos and whatnot) and ARP replies (unicast and gratuitous broadcast on HSRP takeover), I wasn't expecting anything else to come from an HSRP router.  But it sends frames to the peculiar STP uplinkfast address too!  Could L2 switches running uplinkfast be listening to HSRP routers?  Are they doing something with the information in those frames?

Uplinkfast?
Uplinkfast is a Cisco proprietary enhancement to 802.1D (slow) spanning tree.  Switches configured for uplinkfast will identify an alternate root port:  One that's currently in blocking mode, and which isn't self-looped.  If the root port fails, the backup port is put directly into forwarding mode.  It skips the time consuming listening and learning phases.

Any MAC addresses learned on the old root port are moved to the new root port in the forwarding table.  They don't need to be re-learned.

Additionally, the switch sends bogus Ethernet frames out the new root port.  These frames are stamped with spoofed source addresses belonging to client systems on the switch's designated (downstream) ports.  The purpose of these frames is to update the forwarding table on the upstream switches, so they'll forward traffic correctly for our switch's downstream clients, which are now attached to a different spot in the L2 topology.

What's in these spoofed frames?  Consider:
  • Our switch is an L2 device.  It doesn't know anything about its clients' IP addresses, and might not even have an IP address of its own.  IP packets are out.
  • That's okay, because the goal is to update L2 forwarding (mac-address) tables.
  • He wants to update every bridge in the spanning tree that's reachable through his root port, so unicast frames are out.
  • Broadcast frames will be delivered to end stations, who might try to do something with them.  Some IP implementations are built like a house of cards, so it would be good to send frames unlikely to be processed by an IP stack.  Broadcast frames are out.
When an uplinkfast transition occurs, the switch spoofs frames from each downstream client.  The frames are sent to the uplinkfast multicast address 0100.0ccd.cdcd, and flooded throughout the upstream portion of the spanning tree.  The upstream switches don't need to (and probably shouldn't!) be running uplinkfast, and don't even need to be Cisco switches.  The regular MAC learning mechanisms implemented on any learning bridge will update the L2 forwarding table appropriately.

It doesn't really matter what's in a spoofed uplinkfast packet, but the Catalysts in my lab send two frames for each client:
  • The first is an 802.3/SNAP encapsulated frame with mostly unrecognizable contents
  • The second is an Ethernet II encapsulated frame carrying a typecode of ARP, formatted like an ARP frame, but with nonsense contents.
I don't think that Cisco switches apply any special handling to these uplinkfast frames, but I can't be sure about it.  On the surface, there doesn't seem to be any requirement for anything beyond the standard learning mechanisms.

So, what does HSRP have to do with this?
When an HSRP router transitions to the active state, it can face a few challenges, mostly in the areas of L2 topology and ARP tables:
  • Downstream switches might be forwarding the HSRP MAC address towards a dead router
  • Clients might have outstanding ARP queries (if we're the first router to come up)
  • Clients might have bad data in their ARP cache (manual HSRP configuration or use-bia)
A single gratuitous ARP packet will solve all of these problems, provided that we don't have a paranoid client that only processes solicited ARP replies.  So that's what the router does.  He sends a gratuitous ARP, announcing the HSRP IP/MAC mapping to the all-ones (broadcast) Ethernet address.  In fact, he sends 3 of them, with a 3 second pause after each one.

But that's not all.  Immediately after each broadcast ARP, the newly minted active HSRP router sends another ARP to the STP uplinkfast reserved multicast MAC address.  These frames are properly formatted, and except for the different destination MAC (uplinkfast) and ARP target MAC (also uplinkfast) fields, they're identical to the broadcast ARP frames.

I pondered this for a while...  An HSRP state transition could have been precipitated by an L2 topology change...  Is there some circumstance where it would be useful to have a port in listening mode learn a the router MAC anyway?  Or maybe have a port in learning mode forward forward the new router's frame?

I can't figure it out, but there's got to be some reason that Cisco programmed this behavior into their routers, right?

The TAC case has been open for 6 weeks.  Explanations they've given include:
  • The router needs to update L2 forwarding tables.  ...Okay, But the broadcast frames do that job equally well, and the broadcast frames are sent first.  Why code in this crazy multicast address, which appears to be reserved for a different purpose altogether?
  • The L2 switches hear the frames, and adapt the shape of the resulting spanning tree to better accommodate the gateway router.  ...What, what?  How does this work?
  • Oh, never mind.  The spanning tree doesn't change shape.  ...Of course it doesn't.
  • The L2 switches respond to the special MAC, and flush the source from their CAM tables.  Exactly the opposite of learning.   ...Not according to my tests they don't.  And if they did, this would lead to ridiculous learn/flush/learn/flush/learn/flush gyrations.
  • We don't know why this behavior was programmed in.  ...Okay, at least this is easy to believe.
If you have any clue why Cisco might have written the uplinkfast address into HSRP code, please share!

Tuesday, November 23, 2010

BGP Adjacency - Spot The Error

A couple of years ago I configured a topology for a business partner extranet much like the one sketched below.

No dynamic routing was allowed on the firewall.  Layer 9 didn't trust it to run an IGP, so the firewall was configured with static routes:
 - Known internal nets (registered and 1918 space) pointed in
 - Default route pointed out

Two eBGP sessions were configured to learn business partner prefixes (not shown) from the external switch, and redistribute them into the IGP.  It was a small number of prefixes, and they were thoroughly filtered and quantity-limited, making things safe for the IGP.

But it didn't work correctly:  Only one BGP session could be brought up at a time, but never both at once.

The cause of the error took me more hours of head-scratching than I care to admit.  In my defense, the topology was actually quite a bit more complicated than depicted here.  Presented here is the bare minimum required to recreate the problem.

The problem was neither a firewall policy issue, nor a typo.  Any typos here are just typos.

Can you spot my mistake?  Which session comes up, and what's wrong with the other one?