Wednesday, October 26, 2011

Null routes: Float or sink?

I've seen a configuration on customer networks that I just don't understand.

Let's say that some routing tier has been allocated the network block. Because only some of that space is actually in use, a null route has been added so that traffic for unused portions of the space doesn't wander the network aimlessly:
Dist-A#show ip route
Routing entry for
Known via "static", distance 1, metric 0 (connected)
Redistributing via ospf 1
Advertised by ospf 1 subnets
Routing Descriptor Blocks:
* directly connected, via Null0
Route metric is 0, traffic share count is 1

So far so good, but look what's on Dist-B:

Dist-B#show ip route
Routing entry for
Known via "ospf 1", distance 110, metric 20, type extern 2, forward metric 1
Last update from on FastEthernet0/0, 00:24:56 ago
Routing Descriptor Blocks:
*, from, 00:24:56 ago, via FastEthernet0/0
Route metric is 20, traffic share count is 1

Uh-oh. Dist-B isn't trashing traffic for these unused destinations. It's learned the route from Dist-A via an IGP. Maybe they forgot to null route that block on Dist-B?

Dist-B#show run | include ip route
ip route Null0 254
Okay, I get it. Dist-A null-routes the unused portions of If Dist-A fails, then Dist-B's floating (AD of 254) route gets installed so that Dist-B can null-route the unused space.

I'm sure it works fine, but I don't understand why the floatiness is useful. Someone had to design this and type it in. I would like to understand the motivation behind it. Why not just add a normal (AD of 1) null route to both distribution routers?

Is there an advantage to doing things this way? What is it?

Wednesday, October 12, 2011

Redistribution: When adding routes removes routes

Yesterday's comparison of IOS and NX-OS redistribution got me thinking about an IOS-specific redistribution gotcha.  It's one that makes the NX-OS logic look a lot more sensible than the IOS logic that we're accustomed to.

Let's say we're using the same network as yesterday, and redistributing from EIGRP to OSPF:
Redistributing EIGRP -> OSPF
Because R2 is an IOS router, it is sending EIGRP routes (Subnet A) and its connected interfaces running EIGRP (Subnet B and S) into OSPF.  R3 has a route to every segment except R2's loopback interface, which isn't running any routing protocol.
So, let's add redistribution for Subnet L on R2:

route-map Connected->OSPF permit 10
 match interface Loopback 0
route-map Connected->OSPF deny 20

router ospf 1
 redistribute connected subnets route-map Connected->OSPF

What happened?  R3 had three OSPF routes, now it only has two!

Subnet L was introduced to OSPF correctly, but the redistribute connected statement has overridden redistribution of Subnet S and Subnet B, so they've been withdrawn.

As "connected interfaces running EIGRP",  subnets S and B had been introduced to OSPF by the redistribution of EIGRP, but our new redistribution directive (connected interfaces, but only if they're Loopback 0) has trumped redistribute eigrp's claim on them.  They're now withdrawn from OSPF.  Bummer.

Monday, October 10, 2011

Redistribution works differently on NX-OS

There are (at least?) two key differences between redistributing routes in NX-OS and IOS.  A customer had an untimely (2:00 AM) encounter with one of them recently, so I thought I'd spread the word about these gotchas...

You Must Use a Route-Map
NX-OS won't redistribute routes uncontrolled in the same way that IOS will.  In fact, I don't think you can even complete the 'redistribute ' command without referencing a route-map.  You should be using a route-map anyway, so it shouldn't matter.  I've been working with the N7K for about 3 years, and didn't know about this requirement until I saw it in the documentation recently.  This is probably a good sign that I'm always using route-maps :-)

Redistribution Doesn't Include Interfaces
In IOS, when we redistribute from one protocol to another, we get two types of information in the destination protocol:
  1. Routes learned by that protocol, and used by the routing table.  Basically, it's the same set of data returned by show ip route <protocol>
  2. Prefixes associated with any "connected" interfaces running the routing protocol.
So, consider the following network:
Sample Network 1
Redistribution of EIGRP 100 into OSPF on an IOS router would result in three OSPF external routes being learned on R3:
  1. Subnet A would be learned according to rule "1" above.
  2. Subnet B and Subnet S would be learned according to rule "2" above.
NX-OS is different.  It only follows the first rule above, but doesn't include interfaces running the source protocol.

If R2 were an NX-OS router, the only external prefix that would be introduced to the OSPF domain is Subnet A.  In order to get all of the EIGRP subnets into OSPF, you'd need to use redistribute direct in addition to redistribute eigrp.  Note that in addition to a route-map being required, a route-map is desired to emulate the IOS behavior.  Wide-open redistribution on R2 will include Subnet L, which an IOS router would not have introduced to OSPF because neither EIGRP nor OSPF is running in the Loopback interface.

NX-OS's rejection of the 'interfaces running the source protocol' logic can have some surprising consequences.  Lets say we redistribute OSPF into EIGRP in the following network:

Sample Network 2

EIGRP will learn precisely 3 new routes:
The and networks used for OSPF adjacency won't be introduced into EIGRP by NX-OS routers R1 and R2 because these prefixes are considered "direct" routes.

But what happens if an interface fails?
Sample Network 2 With Interface Failure
Suddenly, the prefix is introduced to the EIGRP process.  This happens because is no longer a 'direct' route from R1's perspective.  R1 has learned the path to this network from R2 over their connection.  Because is now an 'OSPF route' rather than a 'direct route via an interface running OSPF', it's elegible for redistribution into EIGRP.

If the route wasn't there before, you probably don't want it appearing when things start to break.  It's a good thing that (required) route-map is in place! :-)

Friday, October 7, 2011

VMware vSwitch Nuggets for Network Admins

I've written previously about the promiscuous behavior of pNICs on ESX servers.  It turns out that the behavior of the vNICs and vSwitches is just as interesting.  Here are a few details that might not be obvious to the network-minded.  The last one is the coolest detail, and it's what got me looking into vSwitches today.

No L2 Address Learning - The vSwitch is an L2 switch, but it doesn't learn MAC addresses the way normal L2 switches do.  Because ESX created the vNIC that's plugged (virtually) into each vSwitch port, ESX knows the hardware addresses used by each vNIC, and the vSwitch port it's "plugged" into. So there's no reason to learn MAC addresses.

L2 Address Updates - On many real NICs, the "burned in" MAC address is stored in ROM, but doesn't have to be used by the NIC driver.  It's more of a suggestion.  The driver can load any valid unicast MAC address into the NICs unicast filtering registers.

Because the vSwitch knows the "burned in" (really it's a value stored in a configuration file) address, the vSwitch can stop the driver from changing its unicast MAC address.  ...Or at least, fail to deliver packets for the updated address.  This behavior is configurable.

L2 Address Spoofing - In addition to refusing to deliver mis-addressed frames to a guest, the vSwitch can refuse admittance to frames generated by the guest unless the frame headers display the expected source MAC address.  Address spoofing like this has numerous savory and unsavory purposes, so the behavior is configurable

IGMP Snooping Not Required - IGMP snooping on a physical L2 switch is a traffic suppression mechanism: only deliver IPv4 multicast frames to ports with subscribers attached.  The mechanism relies on L2 switches intercepting (and sometimes dropping) IGMP traffic between routers and hosts.  When a host subscribes to a multicast group, the host driver programs the NIC to pass the appropriate multicast frames.  Because a virtual machine's NIC is really an ESX software component, ESX (and thus the vSwitch) already knows which ports are subscribed to which multicast groups, so the hassle of IGMP snooping isn't necessary.

Automagic Port Monitor Mode - This feature is usaully described as a mechanis to "Stop promiscuous mode" or somesuch, but I think that description glosses over how cool this is.  Yes, you can stop a server from sniffing traffic not destined for him, but real L2 switches can do that too.  The neat thing here is that when you want to sniff traffic, you don't have to enable a mirror feature on the switch. vSwitches don't have a port mirroring lever for the same reason they don't need to do IGMP snooping.  If you enable promiscuous mode on a VM's vNIC, ESX knows you've done it, and can convert the virtual port into a mirror port automatically.

pSwitch MAC Move Update - When a virtual machine migrates from one host to another (Vmotion), all of the vSwitches know exactly what just happened.  They don't need to be told to update their L2 forwarding tables.  The upstream switches, on the other hand, don't know that a particular MAC address has moved,  and they won't know until a frame sourced from that MAC address shows up sourced from an unexpected port.  This isn't much of a problem for chatty servers, but what if it's a mostly quiet system like a syslog server?  The forwarding tables on upstream pSwitches could misdirect traffic for minutes.  VMotion works around this problem by sending a spoofed broadcast frame, apparently from the guest.  The frame floods throughout the broadcast domain, updating L2 forwarding tables on every bridge.  This behavior is neary identical to how the same problem is handled by Cisco switches configured with the FlexLinks feature.

pSwitch IGMP Snooping Update - If your VMotioned guest is subscribed to an IPv4 multicast stream, the L2 switches are going to have a second forwarding table problem:  They've got an entry mapping the multicast stream to the old ESX server port, but won't know to add the new ESX host's port to the list until an IGMP host report ingresses on the new port.  This is a stickier problem than the previous one, because of the 32:1 overlap of IPv4 multicast groups to group MAC addresses.  ESX knows the MAC address (or worse, the multicast hash bucket) that the client's NIC driver has unfiltered, but it can't know exactly which IPv4 multicast group the client wants to receive.  Without knowing the correct IPv4 multicast address, ESX can't spoof an IGMP host report from the client.  VMware solves this problem by tricking the client into sending his own host reports:  It spoofs an IGMP query from the router, destined to the guest.  On receipt of the spoofed query, the guest then waits for up to the specified query-max-response-time before sending a correctly formatted host report.

I'm not in a position to collect an ESX-spoofed IGMP host report, but I'd really like to see one.  In particular, I'm curious about the IGMP query-max-response-time, and the source IP address used in these spoofed queries.  If you can catch one of theses spoofed queries, please share it with me!