Monday, August 17, 2015

Path MTU Discovery with DMVPN Tunnels

Ivan Pepelnjak's excellent article on IP fragmentation from 2008 is very thorough, but it doesn't cover the functionality of Cisco's tunnel path-mtu-discovery feature when applied to mGRE (DMVPN) interfaces.

I played with it a bit, and was delighted to discover that the dynamic tunnel MTU mechanism operates on a per-NBMA neighbor basis, much the same as ip pim nbma-mode on the same interface type. Both features do all the right things, just like you'd hope they would.

Here's the topology I'm using:
Constrained MTU in path between R1 and R4


The DMVPN tunnel interface on R1 is configured with a 1400-byte MTU. With GRE headers, it will generate packets that can't reach R4. It's also configured with tunnel MTU discovery.
 interface Tunnel0  
  ip address 192.168.1.1 255.255.255.0  
  no ip redirects  
  ip mtu 1400  
  ip pim sparse-mode  
  ip nhrp map multicast dynamic  
  ip nhrp network-id 1  
  tunnel source FastEthernet0/0  
  tunnel mode gre multipoint  
  tunnel path-mtu-discovery  
  tunnel vrf TRANSIT  
 end  

The two spokes are online with NBMA interfaces (tunnel source) using 10.x addressing. Both routers have their NBMA interfaces configured with 1500 byte MTU, and their tunnel MTU set at 1400 bytes:
 R1#show dmvpn  
 Legend: Attrb --> S - Static, D - Dynamic, I - Incomplete  
      N - NATed, L - Local, X - No Socket  
      # Ent --> Number of NHRP entries with same NBMA peer  
      NHS Status: E --> Expecting Replies, R --> Responding  
      UpDn Time --> Up or Down Time for a Tunnel  
 ==========================================================================  
 Interface: Tunnel0, IPv4 NHRP Details   
 Type:Hub, NHRP Peers:2,   
  # Ent Peer NBMA Addr Peer Tunnel Add State UpDn Tm Attrb  
  ----- --------------- --------------- ----- -------- -----  
    1    10.0.23.3   192.168.1.3  UP 00:18:16   D  
    1    10.0.45.4   192.168.1.4  UP 00:13:23   D  
 R1#  

As Ivan noted, tunnel MTU discovery doesn't happen if the Don't Fragment bit isn't set on the encapsulated packet. If, on the other hand, the DF bit is set, then the DF bit gets copied to the GRE packet's (outer) header. Here we don't set the DF bit, and the ping gets through just fine:
 R6#ping 4.4.4.4 source lo0 ti 1 re 1 size 1400  
 Type escape sequence to abort.  
 Sending 1, 1400-byte ICMP Echos to 4.4.4.4, timeout is 1 seconds:  
 Packet sent with a source address of 6.6.6.6   
 !  
 Success rate is 100 percent (1/1), round-trip min/avg/max = 92/92/92 ms  
 R6#  

Those pings will have created 1424-byte packets that don't fit on the link between R2 and R5. Debugs on the target (R4) indicate that the traffic was indeed fragmented in transit:
 *Aug 15 16:33:27.059: IP: s=10.0.12.1 (FastEthernet0/0), d=10.0.45.4 (FastEthernet0/0), len 52, rcvd 3  
 *Aug 15 16:33:27.059: IP: recv fragment from 10.0.12.1 offset 0 bytes  
 *Aug 15 16:33:27.071: IP: s=10.0.12.1 (FastEthernet0/0), d=10.0.45.4 (FastEthernet0/0), len 1392, rcvd 3  
 *Aug 15 16:33:27.071: IP: recv fragment from 10.0.12.1 offset 32 bytes  

52 bytes + 1392 bytes = 1444 bytes. Drop the extra 20 byte IP header from one of those fragments, and we're right at our expected 1424-byte packet size.

So far, no large packets with DF-bit have been sent, so no tunnel MTU discovery has happened. The hub reports dynamic MTU of "0" for the NBMA addresses of both spokes, which I guess means "use the MTU applied to the whole tunnel", which is 1400 bytes in this case:
 R1#show interfaces tunnel 0 | include Path  
  Path MTU Discovery, ager 10 mins, min MTU 92  
  Path destination 10.0.23.3: MTU 0, expires never  
  Path destination 10.0.45.4: MTU 0, expires never  
 R1#  

R6 can ping R1 with an un-fragmentable 1400 byte packet without any problem:
 R6#ping 3.3.3.3 source lo0 ti 1 re 1 size 1400 df-bit   
 Type escape sequence to abort.  
 Sending 1, 1400-byte ICMP Echos to 3.3.3.3, timeout is 1 seconds:  
 Packet sent with a source address of 6.6.6.6   
 Packet sent with the DF bit set  
 !  
 Success rate is 100 percent (1/1), round-trip min/avg/max = 44/44/44 ms  

But when we try this over the constrained path to R3, the ping fails silently. ICMP debugs are on, but no errors rolled in:
 R6#debug ip icmp  
 ICMP packet debugging is on  
 R6#ping 4.4.4.4 source lo0 ti 1 re 1 size 1400 df-bit   
 Type escape sequence to abort.  
 Sending 1, 1400-byte ICMP Echos to 4.4.4.4, timeout is 1 seconds:  
 Packet sent with a source address of 6.6.6.6   
 Packet sent with the DF bit set  
 .  
 Success rate is 0 percent (0/1)  
 R6#

It was R2 that failed to encapsulate the 1424-byte packet onto the constrained link to R5, so he sent a "packet too big" message not to the originator of the ping (R6), but to the originator of the GRE packet (R1):
 R2#  
 *Aug 15 16:42:18.558: ICMP: dst (10.0.45.4) frag. needed and DF set unreachable sent to 10.0.12.1  

R1 reacted by reducing the tunnel MTU only for R4's NBMA address (10.0.45.4). Pretty nifty.
 R1#  
 *Aug 15 16:42:18.582: ICMP: dst (10.0.12.1) frag. needed and DF set unreachable rcv from 10.0.12.2  
 *Aug 15 16:42:18.582: Tunnel0: dest 10.0.45.4, received frag needed (mtu 1400), adjusting soft state MTU from 0 to 1376  
 *Aug 15 16:42:18.586: Tunnel0: tunnel endpoint for transport dest 10.0.45.4, change MTU from 0 to 1376  
 R1#show interfaces tunnel 0 | include Path  
  Path MTU Discovery, ager 10 mins, min MTU 92  
  Path destination 10.0.23.3: MTU 0, expires never  
  Path destination 10.0.45.4: MTU 1376, expires 00:04:05  

Because R1 only reduced the MTU, but didn't alert R6 about the problem, a second ping from R6 is required to provoke the 'frag needed' message from R1, based on its knowledge of the constrained link between R2 and R5:
 R6#ping 4.4.4.4 source lo0 ti 1 re 1 size 1400 df-bit   
 Type escape sequence to abort.  
 Sending 1, 1400-byte ICMP Echos to 4.4.4.4, timeout is 1 seconds:  
 Packet sent with a source address of 6.6.6.6   
 Packet sent with the DF bit set  
 M  
 Success rate is 0 percent (0/1)  
 R6#  
 *Aug 15 16:50:38.999: ICMP: dst (6.6.6.6) frag. needed and DF set unreachable rcv from 192.168.6.1  
 R6#  

We can still send un-fragmentable 1400-byte packets from R6 to R1:
 R6#ping 3.3.3.3 source lo0 ti 1 re 1 size 1400 df-bit   
 Type escape sequence to abort.  
 Sending 1, 1400-byte ICMP Echos to 3.3.3.3, timeout is 1 seconds:  
 Packet sent with a source address of 6.6.6.6   
 Packet sent with the DF bit set  
 !  
 Success rate is 100 percent (1/1), round-trip min/avg/max = 48/48/48 ms  

Now, I would like to be using this feature to discover end-to-end tunnel MTU for some IP multicast traffic on DMVPN, but for some reason my DMVPN interface doesn't generate unreachables in response to multicast packets. They're just dropped silently. Feels like a bug. Not sure what I'm missing.

Update: What I was missing is that RFCs 1112, 1122 and 1812 all specify that ICMP unreachables not be sent in response to multicast packets.