Thursday, August 31, 2017

Using FQDN for DMVPN hubs

I've done some testing with specifying DMVPN hubs (NHRP servers, really) using their DNS name, rather than IP address.

This matters to me because of some goofy environments where spoke routers can't predict what network they'll be on (possibly something other than internet), and where I can't leverage multiple hubs per tunnel due to a control plane scaling issue.

The DNS-based configuration includes the following:

 interface Tunnel1  
  ip nhrp nhs dynamic nbma dmvpn-pool.fragmentationneeded.net  

There's no longer a requirement for any ip nhrp map or ip nhrp nhs x.x.x.x configuration when using this new capability.

My testing included some tunnels with very short ISAKMP and IPSec re-key intervals. I found that the routers performed the DNS resolution just once. They didn't go back to DNS again for as long as the hub was reachable.

Spoke routers which failed to establish a secure connection for whatever reason would re-resolve the hub address each time the DNS response expired its TTL. But once they succeeded in connecting, I observed no further DNS traffic for as long as the tunnel survived.

The record I published (dmvpn-pool.fragmentationneeded.net above) includes multiple A records. The DNS server randomizes the record order in its responses and spoke routers always connected to the first address on the list.

The random-ordered DNS response makes for a kind of nifty load balancing and failover capability:

  1. The spokes will naturally balance across the population of hubs, depending on the whim of the DNS server
  2. I don't strictly need a smart (GSLB style) DNS server to effect failover, because spokes will eventually find their way to a working hub, even with bad records in the list.


With 3 hub routers, the following happens when one fails:

  • At T=0, 67% of the routers remain connected.
  • At T=<keepalive>s, 89% of routers are connected (2/3 of the orphans are back online. The others are trying the dead hub again).
  • At T=TTLx1, 96% of routers are connected (1/3 of the orphans from the previous interval tried the dead hub a second time)
  • At T=TTLx2, 99% of routers are back online
Things recover fairly quickly with short TTL intervals, even without a GSLB because the spokes keep trying, and only need to find a working record once. This DMVPN tunnel isn't the only path in my environment, so a couple of minutes outage is acceptable.


A 60 second TTL will result in ~40K queries/month for each spoke that can't connect (problems with firewall, overload NAT, credentials, etc...), so watch out for that if you're using a service that causes you to pay per query :)