Fragmentation Needed: November 2015

Tuesday, November 10, 2015

Anycast For DMVPN Hubs

Dynamic assignment of DMVPN spoke tunnel addresses isn't just a matter of convenience. It provided the foundation for a recent design which included the following fun requirements:

There are many hub sites.

Spokes will be network-near exactly one hub site.

Latency between hub sites is high.

Bandwidth between hub sites is low.

Spoke routers don't know where they are in the network.

Spoke routers must connect only to the nearest hub.

The underlay topology in this environment¹ made it safe for me to anycast the DMVPN hubs, so that's what I did. This made the "connect to the nearest hub" problem easy to solve, but introduced some new complexity.

Hub Anycast Interface
Each DMVPN router has a loopback interface with address 192.0.2.0/32 assigned to the front-door VRF. It's configured something like this:

 interface loopback 192020
  description DMVPN hub anycast target  
  ip vrf forwarding LTE_TRANSIT  
  ip address 192.0.2.0 255.255.255.255

The 192.0.2.0 /32 prefix was redistributed into the IP backbone. If this device were to fail, then the next-nearest instance of 192.0.2.0 would be selected by the IGP.

Spoke Configuration
Spokes look pretty much exactly like the ones in the DMVPN via DHCP post. They're all looking for a hub at 192.0.2.0. The only interesting bits have to do with BGP neighbors (also anycast.) I'll get to those later.

Hub DMVPN Interface
Each hub ran a separate IP subnet on its DMVPN interface. This means that I needed several of the static interface routes for DHCP as described at the end of the previous post. One of them made DHCP work, while the rest were superfluous, but at least they were correct.

The hub's DMVPN interface sourced the tunnel from this loopback interface, and used LTE_TRANSIT for tunneled traffic. Each hub uses a different IP subnet on this interface. The hub in this example is 203.0.113.1/27

 interface tunnel 192020  
  ip address 203.0.113.1 255.255.255.224
  tunnel source loopback 192020  
  tunnel vrf LTE_TRANSIT

Hub BGP Interface
The hub/spoke routing protocol in this environment is eBGP for various reasons. Ordinarily I'd have the spokes talk to the hub using the hub's address in the DMVPN subnet. That's not possible here because the spoke doesn't know the hubs address, because he doesn't know which hub he's using. anycast to the rescue again!

The hubs each have a loopback at 198.51.100.0/32 in the global routing table (GRT is used for DMVPN - no IVRF here). Spokes are configured to look for a BGP neighbor at this address. There's a problem here: The spoke's BGP neighbor isn't directly connected and the spoke doesn't yet know how to reach this it.

 interface loopback 198511000  
  description anycast target for spoke BGP sessions  
  ip address 198.51.100.0 255.255.255.255  
!
 router bgp 65000  
  bgp listen range 203.0.113.0/27 peer-group PG_SPOKES

DHCP Service
Each hub has a DHCP pool configured to represent the local DMVPN interface. For example, this router is 203.0.113.1/27, so it uses the following pool configuration:

 ip dhcp pool DMVPN_POOL  
  network 203.0.113.0 255.255.255.224  
  option 121 hex 20c6.3364.00cb.0071.01

Option 121 specifies a classless static route to be used by DHCP clients. This is the mechanism by which the spokes find their BGP neighbor at 198.51.100.0. Breaking down the hex, we have:

0x20 = 32 -- This is the prefix length of the route. A host route.
0xc6336400 = 198.51.100.0 -- This is the target prefix of the route, as bounded by the length above. Note that this field is not always 4 bytes long. The RFC authors did this cute/maddening thing where the length of this field depends on the value of the prefix length byte. Ugh.
0xcb007101 = 203.0.113.1 -- The next-hop IP for the route. Hey, that's this hub's tunnel address!

Now, no matter which hub a spoke attaches to, the spoke will find an off-net BGP neighbor at 198.51.100.0, and will have a static route (assigned by BGP) to ensure reachability to that neighbor.

Spoke BGP
The spoke routers use the 'disable-connected-check' feature to talk to the BGP anycast interface on the hub while still using TTL=1:

 router bgp 65001  
  neighbor 198.51.100.0 remote-as 65000  
  neighbor 198.51.100.0 disable-connected-check  
  neighbor 198.51.100.0 update-source Tunnel0

Remaining challenge
The spokes are behind LTE-enabled NAT routers because there's no Cisco hardware available with the correct LTE bands.

Ordinarily, the LTE-assigned IP address won't change with mobility, but it does change if the EPC which owns the client's address is shut down. In those cases, I found the spokes re-established connections with the now-nearest DMVPN hub right away, but the spoke's tunnel interface held onto the old DHCP address.

The if-state nhrp command might have taken care of this², but I've had some bad experiences and don't entirely trust it. I used EEM instead:

 event manager applet BGP_NEIGHBOR_198.51.100.0_DOWN  
  event snmp oid 1.3.6.1.2.1.15.3.1.2.198.51.100.0 get-type exact entry-op ne entry-val "6" exit-op eq exit-val "6" poll-interval 10  
  action 1.0  syslog priority errors msg "BGP with 198.51.100.0 down, bouncing Tunnel 0..."  
  action 2.0  cli command "enable"  
  action 2.1  cli command "configure terminal"  
  action 3.0  cli command "interface tunnel 0"  
  action 3.1  cli command "shutdown"  
  action 4.0  cli command "do clear crypto isakmp sa"  
  action 4.1  cli command "do clear crypto ipsec"  
  action 5.0  cli command "interface tunnel 0"  
  action 5.1  cli command "no shutdown"

1 Spoke routers were wireless devices on a private band LTE network. Hub routers were physically located with LTE equipment, very close to where the LTE Evolved Packet Core hands off to the IP network. There's no opportunity for the "nearest" DMVPN hub to change from one site to another without the spoke losing its IP address on the LTE network.^↩

2 Actually, I'm completely unsure what if-state nhrp will do with a dynamically assigned tunnel address. DHCP can't happen until the interface comes up, and the interface can't come up without NHRP registration, which requires an address... ^↩

Monday, November 9, 2015

OpenSwitch: Exciting Stuff

It was about a month ago that HP (along with several partners) announced OpenSwitch, a new network OS for white box switching hardware.

This week, HPE brought OpenSwitch Chief Architect Michael Zayats to present to TFDx delegates at the ONUG conference. I was fortunate to be one of these delegates and the usual disclaimers apply.

What is OpenSwitch?
It's an open source network OS for whitebox switching platforms. The code is open, and so is the development process. They're actively encouraging people to get involved. Coordination is done over IRC, bug tracking is open, documentation is available for edit, etc... Open. Open. Open.

Who is behind OpenSwitch?
Well, first there's the vendor consortium. To a large degree, it's that new company with the boxy logo: HPE. They employ the chief architect and a handful of developers. There are some other vendors, notably Broadcom (without whom this couldn't happen because of their NDA policies around silicon drivers), switch manufacturers (ODMs), etc...

Also of critical importance are the users: There are already some large end-user companies playing with, using, and contributing to OpenSwitch.

Wait how many OSes is HPE shipping/supporting now?
Yeah... Awkward! That's a couple of versions of Comware, Provision, FASTPATH, plus whatever's going on inside their VirtualConnect / Flex-10 gear. It seems like a lot.

Look, it's not Cisco-style lots of OSes, but each of Cisco's OSes has an easy to understand origin story that begins with either an acquisition or with solving a product problem. Choosing to produce a new OS just because, and then giving it away is something new.

So... Why did HP get this bandwagon rolling?
<speculation>Well, anything that hurts Cisco is good for HP, right?</speculation>

Tell me more about OpenSwitch?
Following are some of the things that stood out from Michael's presentation:

It's a box-stock Linux kernel. There are no OpenSwitch add-ons to the kernel.

OpenSwitch can do "SDN" (whatever that means to you), but it doesn't have to. It can also look and feel like a traditional network OS with a familiar-looking CLI, a one-stop configuration file, no ambiguity about expected behavior when elements get removed from the configuration, etc... This distinguishes it from Cumulus Linux, which I really like, but which isn't well suited to hands-on configuration by network engineers expecting a legacy OS experience.

In addition to operating like a traditional CLI-based device, OpenSwitch has standardized interfaces at every layer for programatic control by almost anything:

JSON configuration interface behind the CLI
RFC7047 (OVSDB) interface between all modules, internal and external
OpenFlow module (which speaks to OVSDB)

And because those interfaces are standardized, if the crazy interface you require isn't offered, you can add it.

OpenSwitch is kind of like a hardware-accelerated version of OvS: It has a kernel dedicated to running only OvS, it runs in a sheetmetal box, and it sports physical network interfaces connected to a dedicated forwarding ASIC. Pretty nifty.

Unlike Cumulus Linux, all of the OpenSwitch physical interfaces are assigned to a dedicated kernel namespace.

Every software module in OpenSwitch talks to OVSDB, including the routing protocols (bgpd, ospfd, ripd, etc...) Rather than use the traditional interprocess mechanism, in which the routing protocols talk to quagga, OpenSwitch moved things around so that the routing protocols publish into OVSDB. Quagga hears from the routing protocols, makes its selections, and publishes the resulting "quagga fib" back into OVSDB.

Frankly, this "everything talks to the database" business looks a lot like Arista's architecture. I think that OVSDB only keeps current info, rather than the Arista's re-playable journal/ledger scheme. Still, this is pretty cool, and represents a huge improvement over monolithic architectures.

The biggest current bummer

There's still a bit of a problem with Broadcom and NDA stuff. As I understand it, below OVSDB is an ASIC-independent driver layer (open source), an ASIC-dependent driver layer (also open source), and a binary-only ASIC-dependent layer (binary blob produced by a Broadcom NDA-holder).

No big surprises there, nobody expected Broadcom to open everything. The problem is that something inside the binary blob is responsible for configuring the ASIC to run in the particular platform. Stuff like the mapping of SERDES lanes onto switch ports lives happens here. This means that you can't throw together an OpenSwitch distribution for any old platform. Only a Broadcom partner can do that. This is a bummer. Broadcom needs to pull this bit out from under the NDA blanket so that people can run OpenSwitch on any platform, not just the ones a Broadcom partner agrees to compile for.