Fragmentation Needed: 2015

Thursday, December 3, 2015

Protocol Spotlight: DLEP

Dynamic Link Exchange Protocol is a mechanism by which link layer devices (probably radio modems) can communicate neighbor reachability information to IP routers using those radios.

Radio interfaces are frequently variable sub-rate interfaces. Path selection is a huge challenge with this sort of handoff, because not only is the available bandwidth less than the speed of the handoff interface, it's a moving target based on RF conditions from moment-to-moment. DLEP provides a flexible framework for communicating link performance and other parameters to the router so that it can make good path selection decisions.

It's obviously handy for point-to-point links, but that's not where it gets really interesting.

Consider the following network topology:

We have four routers sharing a broadcast network (10.0.0.0/24), each with a satellite backup link. Simple stuff, right?

But what if that 10.0.0.0/24 network isn't an Ethernet segment, but was really an ad-hoc mesh of microwave radio modems, and the routers were scattered among various vehicles, drones and robots?

The radios know the topology of the mesh in real time, but the routers plugged into those radios do not.

Wasting microwave bandwidth with BFD packets would be silly because it won't tell you about link quality, nor the number of mesh hops between any two devices.

DLEP solves this problem by allowing each router to learn about reachability of his neighbors (other routers) over the mesh in real time, from the mesh radios.

The Cisco implementation of DLEP in ESR routers works a lot like BFD. Just like BFD, DLEP is a service subscribed to by routing protocols. When DLEP informs the routing protocol about a reachability change, the routing protocol can reconverge immediately on a per neighbor basis. This way, when the lead truck in the convoy (containing R1) crests a hill, other trucks will switch to SATCOM for the 10.0.1.0/24 prefix without waiting for R1's dead timer to expire.

The pictures in this article came from Silvus Technologies because these are the only mesh modems I've played with. Last I talked to them, Silvus didn't support DLEP, but they appear to be working on it.

Tuesday, November 10, 2015

Anycast For DMVPN Hubs

Dynamic assignment of DMVPN spoke tunnel addresses isn't just a matter of convenience. It provided the foundation for a recent design which included the following fun requirements:

There are many hub sites.

Spokes will be network-near exactly one hub site.

Latency between hub sites is high.

Bandwidth between hub sites is low.

Spoke routers don't know where they are in the network.

Spoke routers must connect only to the nearest hub.

The underlay topology in this environment¹ made it safe for me to anycast the DMVPN hubs, so that's what I did. This made the "connect to the nearest hub" problem easy to solve, but introduced some new complexity.

Hub Anycast Interface
Each DMVPN router has a loopback interface with address 192.0.2.0/32 assigned to the front-door VRF. It's configured something like this:

 interface loopback 192020
  description DMVPN hub anycast target  
  ip vrf forwarding LTE_TRANSIT  
  ip address 192.0.2.0 255.255.255.255

The 192.0.2.0 /32 prefix was redistributed into the IP backbone. If this device were to fail, then the next-nearest instance of 192.0.2.0 would be selected by the IGP.

Spoke Configuration
Spokes look pretty much exactly like the ones in the DMVPN via DHCP post. They're all looking for a hub at 192.0.2.0. The only interesting bits have to do with BGP neighbors (also anycast.) I'll get to those later.

Hub DMVPN Interface
Each hub ran a separate IP subnet on its DMVPN interface. This means that I needed several of the static interface routes for DHCP as described at the end of the previous post. One of them made DHCP work, while the rest were superfluous, but at least they were correct.

The hub's DMVPN interface sourced the tunnel from this loopback interface, and used LTE_TRANSIT for tunneled traffic. Each hub uses a different IP subnet on this interface. The hub in this example is 203.0.113.1/27

 interface tunnel 192020  
  ip address 203.0.113.1 255.255.255.224
  tunnel source loopback 192020  
  tunnel vrf LTE_TRANSIT

Hub BGP Interface
The hub/spoke routing protocol in this environment is eBGP for various reasons. Ordinarily I'd have the spokes talk to the hub using the hub's address in the DMVPN subnet. That's not possible here because the spoke doesn't know the hubs address, because he doesn't know which hub he's using. anycast to the rescue again!

The hubs each have a loopback at 198.51.100.0/32 in the global routing table (GRT is used for DMVPN - no IVRF here). Spokes are configured to look for a BGP neighbor at this address. There's a problem here: The spoke's BGP neighbor isn't directly connected and the spoke doesn't yet know how to reach this it.

 interface loopback 198511000  
  description anycast target for spoke BGP sessions  
  ip address 198.51.100.0 255.255.255.255  
!
 router bgp 65000  
  bgp listen range 203.0.113.0/27 peer-group PG_SPOKES

DHCP Service
Each hub has a DHCP pool configured to represent the local DMVPN interface. For example, this router is 203.0.113.1/27, so it uses the following pool configuration:

 ip dhcp pool DMVPN_POOL  
  network 203.0.113.0 255.255.255.224  
  option 121 hex 20c6.3364.00cb.0071.01

Option 121 specifies a classless static route to be used by DHCP clients. This is the mechanism by which the spokes find their BGP neighbor at 198.51.100.0. Breaking down the hex, we have:

0x20 = 32 -- This is the prefix length of the route. A host route.
0xc6336400 = 198.51.100.0 -- This is the target prefix of the route, as bounded by the length above. Note that this field is not always 4 bytes long. The RFC authors did this cute/maddening thing where the length of this field depends on the value of the prefix length byte. Ugh.
0xcb007101 = 203.0.113.1 -- The next-hop IP for the route. Hey, that's this hub's tunnel address!

Now, no matter which hub a spoke attaches to, the spoke will find an off-net BGP neighbor at 198.51.100.0, and will have a static route (assigned by BGP) to ensure reachability to that neighbor.

Spoke BGP
The spoke routers use the 'disable-connected-check' feature to talk to the BGP anycast interface on the hub while still using TTL=1:

 router bgp 65001  
  neighbor 198.51.100.0 remote-as 65000  
  neighbor 198.51.100.0 disable-connected-check  
  neighbor 198.51.100.0 update-source Tunnel0

Remaining challenge
The spokes are behind LTE-enabled NAT routers because there's no Cisco hardware available with the correct LTE bands.

Ordinarily, the LTE-assigned IP address won't change with mobility, but it does change if the EPC which owns the client's address is shut down. In those cases, I found the spokes re-established connections with the now-nearest DMVPN hub right away, but the spoke's tunnel interface held onto the old DHCP address.

The if-state nhrp command might have taken care of this², but I've had some bad experiences and don't entirely trust it. I used EEM instead:

 event manager applet BGP_NEIGHBOR_198.51.100.0_DOWN  
  event snmp oid 1.3.6.1.2.1.15.3.1.2.198.51.100.0 get-type exact entry-op ne entry-val "6" exit-op eq exit-val "6" poll-interval 10  
  action 1.0  syslog priority errors msg "BGP with 198.51.100.0 down, bouncing Tunnel 0..."  
  action 2.0  cli command "enable"  
  action 2.1  cli command "configure terminal"  
  action 3.0  cli command "interface tunnel 0"  
  action 3.1  cli command "shutdown"  
  action 4.0  cli command "do clear crypto isakmp sa"  
  action 4.1  cli command "do clear crypto ipsec"  
  action 5.0  cli command "interface tunnel 0"  
  action 5.1  cli command "no shutdown"

1 Spoke routers were wireless devices on a private band LTE network. Hub routers were physically located with LTE equipment, very close to where the LTE Evolved Packet Core hands off to the IP network. There's no opportunity for the "nearest" DMVPN hub to change from one site to another without the spoke losing its IP address on the LTE network.^↩

2 Actually, I'm completely unsure what if-state nhrp will do with a dynamically assigned tunnel address. DHCP can't happen until the interface comes up, and the interface can't come up without NHRP registration, which requires an address... ^↩

Monday, November 9, 2015

OpenSwitch: Exciting Stuff

It was about a month ago that HP (along with several partners) announced OpenSwitch, a new network OS for white box switching hardware.

This week, HPE brought OpenSwitch Chief Architect Michael Zayats to present to TFDx delegates at the ONUG conference. I was fortunate to be one of these delegates and the usual disclaimers apply.

What is OpenSwitch?
It's an open source network OS for whitebox switching platforms. The code is open, and so is the development process. They're actively encouraging people to get involved. Coordination is done over IRC, bug tracking is open, documentation is available for edit, etc... Open. Open. Open.

Who is behind OpenSwitch?
Well, first there's the vendor consortium. To a large degree, it's that new company with the boxy logo: HPE. They employ the chief architect and a handful of developers. There are some other vendors, notably Broadcom (without whom this couldn't happen because of their NDA policies around silicon drivers), switch manufacturers (ODMs), etc...

Also of critical importance are the users: There are already some large end-user companies playing with, using, and contributing to OpenSwitch.

Wait how many OSes is HPE shipping/supporting now?
Yeah... Awkward! That's a couple of versions of Comware, Provision, FASTPATH, plus whatever's going on inside their VirtualConnect / Flex-10 gear. It seems like a lot.

Look, it's not Cisco-style lots of OSes, but each of Cisco's OSes has an easy to understand origin story that begins with either an acquisition or with solving a product problem. Choosing to produce a new OS just because, and then giving it away is something new.

So... Why did HP get this bandwagon rolling?
<speculation>Well, anything that hurts Cisco is good for HP, right?</speculation>

Tell me more about OpenSwitch?
Following are some of the things that stood out from Michael's presentation:

It's a box-stock Linux kernel. There are no OpenSwitch add-ons to the kernel.

OpenSwitch can do "SDN" (whatever that means to you), but it doesn't have to. It can also look and feel like a traditional network OS with a familiar-looking CLI, a one-stop configuration file, no ambiguity about expected behavior when elements get removed from the configuration, etc... This distinguishes it from Cumulus Linux, which I really like, but which isn't well suited to hands-on configuration by network engineers expecting a legacy OS experience.

In addition to operating like a traditional CLI-based device, OpenSwitch has standardized interfaces at every layer for programatic control by almost anything:

JSON configuration interface behind the CLI
RFC7047 (OVSDB) interface between all modules, internal and external
OpenFlow module (which speaks to OVSDB)

And because those interfaces are standardized, if the crazy interface you require isn't offered, you can add it.

OpenSwitch is kind of like a hardware-accelerated version of OvS: It has a kernel dedicated to running only OvS, it runs in a sheetmetal box, and it sports physical network interfaces connected to a dedicated forwarding ASIC. Pretty nifty.

Unlike Cumulus Linux, all of the OpenSwitch physical interfaces are assigned to a dedicated kernel namespace.

Every software module in OpenSwitch talks to OVSDB, including the routing protocols (bgpd, ospfd, ripd, etc...) Rather than use the traditional interprocess mechanism, in which the routing protocols talk to quagga, OpenSwitch moved things around so that the routing protocols publish into OVSDB. Quagga hears from the routing protocols, makes its selections, and publishes the resulting "quagga fib" back into OVSDB.

Frankly, this "everything talks to the database" business looks a lot like Arista's architecture. I think that OVSDB only keeps current info, rather than the Arista's re-playable journal/ledger scheme. Still, this is pretty cool, and represents a huge improvement over monolithic architectures.

The biggest current bummer

There's still a bit of a problem with Broadcom and NDA stuff. As I understand it, below OVSDB is an ASIC-independent driver layer (open source), an ASIC-dependent driver layer (also open source), and a binary-only ASIC-dependent layer (binary blob produced by a Broadcom NDA-holder).

No big surprises there, nobody expected Broadcom to open everything. The problem is that something inside the binary blob is responsible for configuring the ASIC to run in the particular platform. Stuff like the mapping of SERDES lanes onto switch ports lives happens here. This means that you can't throw together an OpenSwitch distribution for any old platform. Only a Broadcom partner can do that. This is a bummer. Broadcom needs to pull this bit out from under the NDA blanket so that people can run OpenSwitch on any platform, not just the ones a Broadcom partner agrees to compile for.

Friday, October 16, 2015

Musings on Datanauts #9

I listened to episode 9 of the excellent Datanauts podcast with Ethan Banks and Chris Wahl recently.

Great job with this one, guys. I can tell how engaged I am in a podcast by how often I want to interrupt you :)

For this episode, that was lots of times!

Since I couldn't engage during the podcast, I'm going to have a one-sided discussion here, about the topics that grabbed my attention.

RARP?
Chris explained that the 'notify switches' feature of an ESXi vSwitch serves to update the L2 filtering table on upstream physical switches. This is necessary any time a VM moves from one physical link (or host) to another.

Updating the tables in all of the physical switches in the broadcast domain can be accomplished with any frame that meets the following criteria:

Sourced from the VM's MAC address
Destined for an L2 address that will flood throughout the broadcast domain
Specifies an Ethertype that the L2 switches are willing to forward

VMware chose to do it with a RARP frame, probably because it's easy to spoof, and shouldn't hurt anything. What's RARP? It's literally Reverse ARP. Instead of a normal ARP query, which asks: "Who has IP x.x.x.x?" RARP's question is: "Who am I?"

It's like a much less feature-packed alternative to DHCP:

RARP can't be relayed by routers (as far as I know)
RARP doesn't share subnet mask (RARP clients use ICMP type 17 code 0 instead)
RARP doesn't tell the client about routers or DNS, NTP, or WINS servers, etc...

I used to use RARP with SPARC-based systems: Bootup of diskless workstations, and Jumpstart-based server installs. A decade ago, I even got in.rarpd and ICMP subnet replies, tftp, nfs and all of the other services configured on my macbook so that I could Jumpstart large trading platforms from it. Man, that would have been an epic blog post...

Okay, so why RARP when GARP will do?

The answer has to do with what a hypervisor can reasonably be expected to know about the guest. Sending a RARP is easy, because it only requires knowledge of the vNIC's MAC address. No problem, because that vNIC is part of ESXi.

Sending a GARP, on the other hand, requires that the sender know the IP address of the guest, which isn't necessarily going to be speedy (or even possible) for a hypervisor to know. Heck, the guest might not even speak IP! Then what?

Hey, what about multicast!

It feels like the guys missed an opportunity to talk about something cool here. Or maybe not, Greg would tell me that nobody cares about multicast.

When a guest moves, ESXi also has to jump through some hoops to un-cork the L2 multicast filters on the new physical port used by the guest. Unlike the case of the Unicast filters, where the hypervisor just knows the guest's MAC address, it can't know the guest's multicast membership, so it can't spoof any upstream messages.

Instead, it spoofs a downstream message to the guest. In it is an IGMP general membership query demanding immediate (response time = 0) information about membership in any multicast groups. The guest responds with its interest and those responses are forwarded out to the physical network where filters get adjusted.

Multiple vSwitches

Chris and Ethan never spelled this out explicitly, but I'm under the impression that running multiple vSwitches (for isolation or whatever reason) requires that you have physical NICs/cables/pSwitch ports for each vSwitch.

If your DMZ subnet, internal networks and storage LANs all live on different physical switches, then you'll need to cable different pNICs to each of these environments. Assigning each of these pNICs to its own vSwitch just seems like good sense in that case.

On the other hand, you probably wouldn't create separate vSwitches just because some notion of isolation feels like a good idea, because doing so requires you to produce pNICs, cabling and pSwitch ports for each of those vSwitches.

Maybe I'm off base here? It felt like Ethan was saying "Oh, so maybe I do want to create lots of vSwitches sometimes..." without addressing the physical side requirements that will come along with that decision.

Hating on LACP

I think the guys missed the boat on this one.

There's a bandwidth argument (or several!) to be made here. Consider LACP's load balancing capabilities:

Per-flow load balancing (or at least per src/dst IP pair - not sure about ESXi limitations here)
The pool of load-balanced objects includes all flows in/out of the host
Balancing decision made on a per-frame basis

Those win out over the migratory per-guest vPort pinning scheme from every angle:

Per-guest load balancing
A single guest can never exceed the speed of a single link
The pool of balanced objects is much smaller (guests)
Balancing decision made at long intervals
With a slightly clunky migration scheme (traffic - particularly IP multicast will be lost on rebalance with large L2 topologies).

But frankly, the bandwidth and migration reasons aren't the ones that matter most when it comes to making the LACP vs. pinning decision.

I wish that Chris had elaborated on the complications of using LACP on ESXi, because the complications of not using LACP in a sophisticated enterprise environment are substantial. The issues that matter here have to do with network topology, forwarding paths and failure modes in the physical network.

If we're talking about all pNICs linking to a single ToR switch, then it really doesn't matter. LACP and load based pinning schemes will accomplish the same job.

But when we're probably talking about an enterprise environment with redundant ToR switches and fancy features like MLAG, the tide shifts toward LACP. Heck, the switches are probably already doing MLAG on some other interfaces, and this matters in some failure modes. There some non-obvious ways to go wrong here:

Traffic between guests on two different hosts will take a needlessly long path if their randomly-chosen pNICs land on different physical switches.
In a Cisco Nexus world, the pSwitch ports will be known as orphan ports. These are loaded with crazy caveats in various failure modes.

These are not insurmountable obstacles, but they definitely shouldn't be taken lightly. Orphan ports in particular are a nasty can of worms.

Thursday, October 15, 2015

HP Is Shipping Unicorns Now: 10GBASE-T SFP+ Module

It's long been said that we'll never see an SFP+ transceiver for 10GBASE-T media. Too much power, too small package, too much heat, etc...

I'm not sure that never is quite right. There's this wonderful/horrible contraption:

Dawnray SFP+ module. Photo found here.

It's huge. It's ugly. Its covered with fins, so it must be hot. The data sheet says it consumes 7 Watts. Where's it getting 7W? Not from the SFP+ interface on the switch... Note the power cord attached to the module. It uses a wall wart!

This is not an elegant solution, but 10GBASE-T is hard, and this is the best we've got.

Until now.

/u/asdlkf recently pointed out on reddit that HP have published a data sheet ¹ for a much more elegant SFP+ module for 10GBASE-T.

There were rumors that this module was going to have a giant heatsink and protrude far beyond the SFP+ slot, but it turns out that's not the case. It looks really good, and it's only a bit longer than some 1000BASE-T modules that I have kicking around the office.

The module uses only 2.3W (no wall wart required, but plugging in lots of them will still tax most switches), but is a bit of a compromise in that it can only push 10GBASE-T 30m (the standard calls for 100m).

I'm not advocating for 10GBASE-T (I suspect Ferro would never speak to me again!) I'd rather use DAC or DAO transceivers for intra-rack links and optical transceivers inter-rack because they're better than 10GBASE-T in so many ways:

Lower power
Lower latency
Lower bit error rate
Smaller cable diameter
Lower in-rack cost
Longer inter-rack runs

But I'm sure this nifty transceiver will solve some problems. Congratulations HP, for being first to market with a usable option.

1 HP's data sheet also has a funny typo in the Environment section. I may have to wait for global warming to melt the polar ice caps, raising sea level a bit before I can deploy one of these units.^↩

Tuesday, September 29, 2015

Assigning DMVPN tunnel interface addresses with DHCP

I posted previously about some of the inner workings of DHCP. The three key points from that post are critical building blocks for this discussion:

DHCP requests get modified in flight by the DHCP relay.
DHCP relay determines L2 destination by inspecting contents of relayed packets.
DHCP clients, relays and (sometimes) servers use raw sockets because the end-to-end protocol stack isn't yet available.

The basic steps to converting a DMVPN from static address assignment scheme to dynamic are:

Configure a DHCP server. I'm using an external server¹ in this example so that we can inspect the relayed packets while they're on the wire.
Configure the hub router. There are some non-intuitive details we'll go over.
Configure the spoke router. Ditto on the non-intuitive bits.

My DHCP server is running on an IOS router (because it's convenient - it could be anywhere) and it has the following configuration:

    1     no ip dhcp conflict logging  
    2     ip dhcp excluded-address 172.16.1.1  
    3     !  
    4     ip dhcp pool DMVPN_POOL  
    5      network 172.16.1.0 255.255.255.0

So, that's pretty straightforward.

The Hub Router has the following relevant configuration:

    1     ip dhcp support tunnel unicast  
    2     interface Tunnel0  
    3      ip dhcp relay information option-insert   
    4      ip address 172.16.1.1 255.255.255.0  
    5      ip helper-address 172.16.2.2  
    6      no ip redirects  
    7      ip mtu 1400  
    8      ip nhrp authentication blah  
    9      ip nhrp network-id 1  
   10      ip tcp adjust-mss 1360  
   11      tunnel source GigabitEthernet0/0  
   12      tunnel mode gre multipoint  
   13      tunnel vrf INTERNET  
   14      tunnel protection ipsec profile DMVPN_IPSEC_PROFILE

Only lines 1, 3 and 5 were added when I converted the environment from static spoke tunnel addresses to dynamic addresses.

Line 1: According to the documentation, the ip dhcp support tunnel unicast directive "Configures a spoke-to-hub tunnel to unicast DHCP replies over the DMVPN network." Okay, so the replies are sent directly to the spoke. If you read my last post, you're probably wondering how this works. I promise, it's interesting.

Line 2: Configuring ip dhcp relay information option-insert causes the relay agent to insert the DHCP relay agent information option (option 82) into the client's DHCP packets. We'll look at the contents in a bit.

Line 3: I specified the address of the DHCP server with ip helper-address 172.16.2.2. Nothing unusual about that.

Okay, with those configuration directives, the hub is going to unicast DHCP messages to the client (good, because DMVPN provides a non-broadcast medium). How will the relay agent on the hub know where to send the server's DHCP replies? Last week's example was an Ethernet medium. The Ethernet MAC address appears in the OFFER, so the relay cracked it open in order to know where the OFFER needed to go. This OFFER will also have a MAC address, but it doesn't help. Rather than an un-solved ARP problem, this relay has an un-solved NHRP problem. It needs to know the NBMA (Internet) address of the spoke in order to deliver the OFFER.

So what did the relay stick in option 82 of the client's DISCOVER message, anyway? This:

    1       Option: (82) Agent Information Option  
    2         Length: 13  
    3         Option 82 Suboption: (9) Vendor-Specific Information  
    4           Length: 11  
    5           Enterprise: ciscoSystems (9)  
    6             Data Length: 6  
    7           Value: 11046d7786c2

So, we've got option 82 with sub-option 9 containing a Cisco-proprietary 6-byte payload. I don't know what 0x11 and 0x04 represent, but I'm guessing it's "NBMA" and "IPv4"², because the next 4 bytes (0x6d7786c2) spell out the spoke's NBMA address (109.119.134.194).

The DMVPN hub / DHCP relay agent jammed the spoke's NBMA address into the DHCP DISCOVER that it relayed on to the DHCP server! The debugs spell it out for us:

    1     DHCPD: adding relay information option.  
    2     DHCPD: Client's NBMA address 109.119.134.194 added tooption 82

Now look at what the hub/relay did when the DHCP OFFER came back from the server:

    1     DHCPD: forwarding BOOTREPLY to client 88f0.31f4.8a3a.  
    2     DHCPD: Client's NBMA address is 109.119.134.194  
    3     NHRP: Trying to add temporary cache from external source with (overlay address: 172.16.1.3, NBMA: 109.119.134.194) on interface: (Tunnel0).  
    4     DHCPD: creating NHRP entry (172.16.1.3, 109.119.134.194)  
    5     DHCPD: unicasting BOOTREPLY to client 88f0.31f4.8a3a (172.16.1.3).  
    6     DHCPD: Removing NHRP entry for 172.16.1.3  
    7     NHRP: Trying to delete temporary cache created from external source with nex-hop UNKNOWN on interface: (Tunnel0).

Just like the Ethernet case, the relay read the client's lower layer address info from the OFFER. Unlike the Ethernet case, the DMVPN relay gleaned this critical information from a field which had been inserted by the relay itself.

Somewhat different from the Ethernet case (which, according to debugs, did not populate the relay's ARP table with info from the OFFER), the DMVPN relay manipulated the NHRP table directly. Pretty cool.

A point about dual-hub environments: All of these DHCP gyrations are only possible because the DMVPN hub and spoke have already keyed the IPSec tunnel, and have an active Security Association with one another. In a dual hub environment, only one of the hubs will be able to talk to the client at this stage. Remember that the DHCP server unicasts the OFFER to the relay agent using the agent's client-facing address (172.16.1.1 in this case). If we have a second hub at, say, 172.16.1.2, it's likely that both hubs will be advertising the tunnel prefix (172.16.1.0/24) into the IGP. Routing tables in nearby routers won't draw a distinction between hub 1 (172.16.1.1) and hub 2 (172.16.1.2) when delivering the OFFER (unicast to 172.16.1.1) from the DHCP server, because both hub routers look like equally good ways to reach the entire 24-bit prefix. It is critical that the IGP has 32-bit routes so that traffic for each hub router's tunnel interface gets delivered directly to the correct box.

The Spoke Router sports the following configuration:

    1     interface Tunnel0  
    2      ip dhcp client broadcast-flag clear  
    3      ip address dhcp  
    4      no ip redirects  
    5      ip mtu 1400  
    6      ip nat enable  
    7      ip nhrp authentication blah  
    8      ip nhrp network-id 1  
    9      ip nhrp nhs dynamic nbma 109.119.134.213 multicast  
   10      ip tcp adjust-mss 1360  
   11      tunnel source FastEthernet4  
   12      tunnel mode gre multipoint  
   13      tunnel vrf INTERNET  
   14      tunnel protection ipsec profile DMVPN_IPSEC_PROFILE  
   15     ip route 172.16.1.0 255.255.255.0 Tunnel0  
   16     ip route 0.0.0.0 0.0.0.0 FastEthernet4 dhcp

Line 2: This option clears the broadcast flag bit in the DISCOVER's bootp header. This doesn't affect delivery of the DISCOVER, because of the multicast keyword on line 9. Clearing the broadcast flag in the DISCOVER also results in no broadcast bit in the OFFER message which needs to be relayed to the client. This is important because the hub doesn't have an outbound multicast capability. The DHCP OFFER will never get delivered in over the NBMA transport if the broadcast bit is set.

Line 3: Pretty self explanatory

Line 9: I'm partial to this method of configuring the NHS server (as opposed to two lines: one specifying the tunnel address of the NHS, and one specifying the mapping of NHS tunnel IP to NMBA IP. At any rate, there's a critical detail: the multicast keyword must be present in either configuration. Without it, the spoke won't have an NHRP mapping with which to send the DISCOVER messages upstream.

Line 15: This is a funny one, and might not be required in all environments.

First, consider line 16. There is a default route in the global table via the DHCP-learned next hop in vrf INTERNET. This is here to facilitate split tunneling: Devices behind this DMVPN spoke can access the Internet via overload NAT on interface Fa4.

Next, remember what the OFFER message looks like: It's an IPv4 unicast destined for an address we don't yet own.

Look what comes out FastEthernet 4 if we don't have the route on line 15 in place:

 poetaster:~ chris$ tcpdump -r /tmp/routed.pcapng   
 reading from file /tmp/routed.pcapng, link-type EN10MB (Ethernet)  
 16:25:09.481759 IP 172.16.1.1.bootps > 172.16.1.3.bootpc: BOOTP/DHCP, Reply, length 300  
 16:25:12.677629 IP 172.16.1.1.bootps > 172.16.1.3.bootpc: BOOTP/DHCP, Reply, length 300  
 poetaster:~ chris$

Those are our DHCP OFFERS, and we're spitting them out toward the Internet!

What's going on here? Well, we've got an IPSec SA up and running with our DMVPN hub. We don't yet have an IP address, but IPSec doesn't care about that. It's not a VPN, it's more of a transport security mechanism, right? An encrypted packet rolls in from the hub, matches the SA, gets decrypted and then... Routed! The DHCP client process never saw it. The OFFER missed the raw socket (or whatever Cisco is doing) mechanism entirely!

Line 16 pushes the wayward OFFER back toward the tunnel interface, where the DHCP client implementation will find it. This route doesn't need to match the tunnel interface exactly, it just needs to be the best match for the address we will be offered. I used a 24-bit route to match the tunnel, but these would have worked too:

172.16.1.3/32 (requires me to know my address in advance)
172.16.0.0/12 (requires that I don't learn a better route to my new IP via some path other than Tu0)

Maybe there's a way to handle the incoming DHCP traffic with PBR? That would be an improvement because this static route detail is the only place in the configuration which requires the spoke to know anything about the prefix running on the DMVPN transport he'll be using.

1 Cisco's documentation indicates that the DHCP server cannot run on the DMVPN hub. Take that indication with some salt because (a) That is an IOS XE command guide, and IOS XE doesn't have some of the indicated commands. (b) Running the DHCP server locally seems to work just fine.^↩

2 Maybe "IPv4 NBMA" and "length" ? Eh. That's why it's called a vendor proprietary option. ^↩

Friday, September 25, 2015

Cisco DHCP client bummer

It looks to me like the Cisco IOS DHCP client mis-handles the DNS server option when it's working in a VRF.

I'm working on an IOS 15.4 router with an empty startup-config and only the following configuration applied:

 interface FastEthernet4  
  ip address dhcp  
  no shutdown

debug dhcp detail produces the following when the DHCP lease is claimed:

 Sep 25 19:48:23.316: DHCP: Received a BOOTREP pkt  
 Sep 25 19:48:23.316: DHCP: Scan: Message type: DHCP Offer  
 ...  
 Sep 25 19:48:23.316: DHCP: Scan: DNS Name Server Option: 192.168.100.4

Indeed, we can resolve DNS. We can also see that the DNS server learned from DHCP has been configured (is there a better way to see this?):

 lab-C881#ping google.com  
 Translating "google.com"...domain server (192.168.100.4) [OK]  
 Type escape sequence to abort.  
 Sending 5, 100-byte ICMP Echos to 205.158.11.53, timeout is 2 seconds:  
 !!!!!  
 Success rate is 100 percent (5/5), round-trip min/avg/max = 4/4/8 ms  
 lab-C881#show hosts summary  
 Default domain is fragmentationneeded.net  
 Name/address lookup uses domain service  
 Name servers are 192.168.100.4  
 Cache entries: 5  
 Cache prune timeout: 50  
 lab-C881#

If I put the interface into a VRF, like this...

 ip vrf INTERNET  
 interface FastEthernet4  
  ip vrf forwarding INTERNET  
  ip address dhcp  
  no shutdown

Debugs look the same, but we can't find google, and we don't seem to have a DNS server configured:

 lab-C881#ping vrf INTERNET google.com    
 % Unrecognized host or address, or protocol not running.  
 lab-C881#show hosts vrf INTERNET summary  
 lab-C881#

The global forwarding table has no interfaces up, but it's trying to use the DNS server which is reachable only within the VRF:

 lab-C881#ping google.com    
 Translating "google.com"...domain server (192.168.100.4)  
 % Unrecognized host or address, or protocol not running.  
 lab-C881#show hosts summary  
 Default domain is fragmentationneeded.net  
 Name/address lookup uses domain service  
 Name servers are 192.168.100.4  
 Cache entries: 1  
 Cache prune timeout: 42

Of course, without any interfaces, attempts to talk to the DNS server from the global table will fail. This is kind of a bummer.

Monday, September 14, 2015

Just some quick points about DHCP

Okay, so everybody knows DHCP pretty well.

I just want to point out a few little details as background for a future post:

DHCP Relays Can Change Things
The first point is about those times when the DHCP client and server aren't on the same segment.

In these cases, a DHCP relay (usually running on a router) scoops up the helpless client's broadcast packets and fires them at the far away DHCP server. The server's replies are sent back to the relay, and the relay transmits them onto the client subnet.

The DHCP relay can change several things when relaying these packets:

It increments the bootp hop counter.
It populates the relay agent field in the bootp header (The DHCP server uses this to identify the subnet where the client is looking for a lease).
It can introduce additional DHCP options to the request.

The last one is particularly interesting. When a DHCP relay adds information to a client message, it can be used by the DHCP server for decision-making or logging purposes. Alternatively, the added information can be used by the DHCP relay itself: Because the relay's addition will be echoed back by the server, the relay can parse information it added to a DISCOVER message when relaying the resulting OFFER message back toward the client.

DHCP servers shortcut ARP
Consider the following DHCP offer sent to a client:

1:  14:11:09.966124 a4:93:4c:46:d3:3f (oui Unknown) > 40:6c:8f:38:26:60 (oui Unknown), ethertype IPv4 (0x0800), length 342: (tos 0x0, ttl 255, id 30511, offset 0, flags [none], proto UDP (17), length 328)  
2:    192.168.26.254.bootps > 192.168.26.225.bootpc: BOOTP/DHCP, Reply, length 300, hops 1, xid 0x91b0d169, Flags [none]  
3:        Your-IP 192.168.26.225  
4:        Gateway-IP 192.168.26.254  
5:        Client-Ethernet-Address 40:6c:8f:38:26:60 (oui Unknown)  
6:        Vendor-rfc1048 Extensions  
7:         Magic Cookie 0x63825363  
8:         DHCP-Message Option 53, length 1: Offer  
9:         Server-ID Option 54, length 4: 192.168.5.13  
10:         Lease-Time Option 51, length 4: 7200  
11:         Subnet-Mask Option 1, length 4: 255.255.255.224  
12:         Default-Gateway Option 3, length 4: 192.168.26.254  
13:         Domain-Name-Server Option 6, length 4: localhost  
14:         Domain-Name Option 15, length 5: "a.net"

Line 2 indicates that it's a unicast IP packet, sent to 192.168.26.225. Line 1 tells us the packet arrived in a unicast Ethernet frame. Nothing too unusual looking about it.

But how did the DHCP relay encapsulate this IP packet into that Ethernet frame? Usually the dMAC seen in the frame header is the result of an ARP query. That can't work in this case: The client won't answer an ARP query for 192.168.26.225 because it doesn't yet own that address.

The encapsulation seen here skipped over ARP. Instead, the relay pulled the dMAC from the DHCP payload (line 5). Nifty.

DHCP Clients use Raw Sockets
Everything we know about learning bridges and interface promiscuity applies here. The IP layer in the client system receives this frame because it was sent to the unicast MAC address and the client's NIC allowed it in.

Line 2 indicates that it's also a unicast IP packet, sent to 192.168.26.225.

But this is the DHCP OFFER. The client doesn't yet own the address to which the offer is sent. In fact, the client don't even know that the address available for its use until it gets to line 3 (a bootp option inside a UDP datagram inside this IP packet). The client won't have actually leased this address until it Requests this address from the server, and the server ACKs the request.

So how can the client receive this IP packet, when it's sent to an address that nobody owns?

The answer is Raw Sockets. Raw sockets give a privileged program the ability to undercut various layers (UDP and IP in this case) of the OS stack, and send/receive messages directly. This is also probably how the DHCP relay encapsulated the offer (skipping ARP) in the first place.

The details of raw socket implementations are specific to the operating system, but usually include the ability to specify an interface and to apply a filter, so that the application doesn't have to process every message on the wire.

Monday, August 17, 2015

Path MTU Discovery with DMVPN Tunnels

Ivan Pepelnjak's excellent article on IP fragmentation from 2008 is very thorough, but it doesn't cover the functionality of Cisco's tunnel path-mtu-discovery feature when applied to mGRE (DMVPN) interfaces.

I played with it a bit, and was delighted to discover that the dynamic tunnel MTU mechanism operates on a per-NBMA neighbor basis, much the same as ip pim nbma-mode on the same interface type. Both features do all the right things, just like you'd hope they would.

Here's the topology I'm using:

Constrained MTU in path between R1 and R4

The DMVPN tunnel interface on R1 is configured with a 1400-byte MTU. With GRE headers, it will generate packets that can't reach R4. It's also configured with tunnel MTU discovery.

 interface Tunnel0  
  ip address 192.168.1.1 255.255.255.0  
  no ip redirects  
  ip mtu 1400  
  ip pim sparse-mode  
  ip nhrp map multicast dynamic  
  ip nhrp network-id 1  
  tunnel source FastEthernet0/0  
  tunnel mode gre multipoint  
  tunnel path-mtu-discovery  
  tunnel vrf TRANSIT  
 end

The two spokes are online with NBMA interfaces (tunnel source) using 10.x addressing. Both routers have their NBMA interfaces configured with 1500 byte MTU, and their tunnel MTU set at 1400 bytes:

 R1#show dmvpn  
 Legend: Attrb --> S - Static, D - Dynamic, I - Incomplete  
      N - NATed, L - Local, X - No Socket  
      # Ent --> Number of NHRP entries with same NBMA peer  
      NHS Status: E --> Expecting Replies, R --> Responding  
      UpDn Time --> Up or Down Time for a Tunnel  
 ==========================================================================  
 Interface: Tunnel0, IPv4 NHRP Details   
 Type:Hub, NHRP Peers:2,   
  # Ent Peer NBMA Addr Peer Tunnel Add State UpDn Tm Attrb  
  ----- --------------- --------------- ----- -------- -----  
    1    10.0.23.3   192.168.1.3  UP 00:18:16   D  
    1    10.0.45.4   192.168.1.4  UP 00:13:23   D  
 R1#

As Ivan noted, tunnel MTU discovery doesn't happen if the Don't Fragment bit isn't set on the encapsulated packet. If, on the other hand, the DF bit is set, then the DF bit gets copied to the GRE packet's (outer) header. Here we don't set the DF bit, and the ping gets through just fine:

 R6#ping 4.4.4.4 source lo0 ti 1 re 1 size 1400  
 Type escape sequence to abort.  
 Sending 1, 1400-byte ICMP Echos to 4.4.4.4, timeout is 1 seconds:  
 Packet sent with a source address of 6.6.6.6   
 !  
 Success rate is 100 percent (1/1), round-trip min/avg/max = 92/92/92 ms  
 R6#

Those pings will have created 1424-byte packets that don't fit on the link between R2 and R5. Debugs on the target (R4) indicate that the traffic was indeed fragmented in transit:

 *Aug 15 16:33:27.059: IP: s=10.0.12.1 (FastEthernet0/0), d=10.0.45.4 (FastEthernet0/0), len 52, rcvd 3  
 *Aug 15 16:33:27.059: IP: recv fragment from 10.0.12.1 offset 0 bytes  
 *Aug 15 16:33:27.071: IP: s=10.0.12.1 (FastEthernet0/0), d=10.0.45.4 (FastEthernet0/0), len 1392, rcvd 3  
 *Aug 15 16:33:27.071: IP: recv fragment from 10.0.12.1 offset 32 bytes

52 bytes + 1392 bytes = 1444 bytes. Drop the extra 20 byte IP header from one of those fragments, and we're right at our expected 1424-byte packet size.

So far, no large packets with DF-bit have been sent, so no tunnel MTU discovery has happened. The hub reports dynamic MTU of "0" for the NBMA addresses of both spokes, which I guess means "use the MTU applied to the whole tunnel", which is 1400 bytes in this case:

 R1#show interfaces tunnel 0 | include Path  
  Path MTU Discovery, ager 10 mins, min MTU 92  
  Path destination 10.0.23.3: MTU 0, expires never  
  Path destination 10.0.45.4: MTU 0, expires never  
 R1#

R6 can ping R1 with an un-fragmentable 1400 byte packet without any problem:

 R6#ping 3.3.3.3 source lo0 ti 1 re 1 size 1400 df-bit   
 Type escape sequence to abort.  
 Sending 1, 1400-byte ICMP Echos to 3.3.3.3, timeout is 1 seconds:  
 Packet sent with a source address of 6.6.6.6   
 Packet sent with the DF bit set  
 !  
 Success rate is 100 percent (1/1), round-trip min/avg/max = 44/44/44 ms

But when we try this over the constrained path to R3, the ping fails silently. ICMP debugs are on, but no errors rolled in:

 R6#debug ip icmp  
 ICMP packet debugging is on  
 R6#ping 4.4.4.4 source lo0 ti 1 re 1 size 1400 df-bit   
 Type escape sequence to abort.  
 Sending 1, 1400-byte ICMP Echos to 4.4.4.4, timeout is 1 seconds:  
 Packet sent with a source address of 6.6.6.6   
 Packet sent with the DF bit set  
 .  
 Success rate is 0 percent (0/1)  
 R6#

It was R2 that failed to encapsulate the 1424-byte packet onto the constrained link to R5, so he sent a "packet too big" message not to the originator of the ping (R6), but to the originator of the GRE packet (R1):

 R2#  
 *Aug 15 16:42:18.558: ICMP: dst (10.0.45.4) frag. needed and DF set unreachable sent to 10.0.12.1

R1 reacted by reducing the tunnel MTU only for R4's NBMA address (10.0.45.4). Pretty nifty.

 R1#  
 *Aug 15 16:42:18.582: ICMP: dst (10.0.12.1) frag. needed and DF set unreachable rcv from 10.0.12.2  
 *Aug 15 16:42:18.582: Tunnel0: dest 10.0.45.4, received frag needed (mtu 1400), adjusting soft state MTU from 0 to 1376  
 *Aug 15 16:42:18.586: Tunnel0: tunnel endpoint for transport dest 10.0.45.4, change MTU from 0 to 1376  
 R1#show interfaces tunnel 0 | include Path  
  Path MTU Discovery, ager 10 mins, min MTU 92  
  Path destination 10.0.23.3: MTU 0, expires never  
  Path destination 10.0.45.4: MTU 1376, expires 00:04:05

Because R1 only reduced the MTU, but didn't alert R6 about the problem, a second ping from R6 is required to provoke the 'frag needed' message from R1, based on its knowledge of the constrained link between R2 and R5:

 R6#ping 4.4.4.4 source lo0 ti 1 re 1 size 1400 df-bit   
 Type escape sequence to abort.  
 Sending 1, 1400-byte ICMP Echos to 4.4.4.4, timeout is 1 seconds:  
 Packet sent with a source address of 6.6.6.6   
 Packet sent with the DF bit set  
 M  
 Success rate is 0 percent (0/1)  
 R6#  
 *Aug 15 16:50:38.999: ICMP: dst (6.6.6.6) frag. needed and DF set unreachable rcv from 192.168.6.1  
 R6#

We can still send un-fragmentable 1400-byte packets from R6 to R1:

 R6#ping 3.3.3.3 source lo0 ti 1 re 1 size 1400 df-bit   
 Type escape sequence to abort.  
 Sending 1, 1400-byte ICMP Echos to 3.3.3.3, timeout is 1 seconds:  
 Packet sent with a source address of 6.6.6.6   
 Packet sent with the DF bit set  
 !  
 Success rate is 100 percent (1/1), round-trip min/avg/max = 48/48/48 ms

Now, I would like to be using this feature to discover end-to-end tunnel MTU for some IP multicast traffic on DMVPN, but for some reason my DMVPN interface doesn't generate unreachables in response to multicast packets. They're just dropped silently. Feels like a bug. Not sure what I'm missing.

Update: What I was missing is that RFCs 1112, 1122 and 1812 all specify that ICMP unreachables not be sent in response to multicast packets.

Monday, July 20, 2015

Link Aggregation on HP Moonshot - A Neat Trick

The Broadcom switching OS running on HP's Moonshot 45G and 180G switches can do a neat trick¹ that I haven't seen on other platforms.

Background: LACP-Individual
The trick revolves around interfaces that are sometimes aggregated, and sometimes run as individuals. Lots of platforms don't support this behavior. On those platforms, if an interface is configured to attempt aggregation but doesn't receive LACP PDUs, the interface won't forward traffic at all. Less broken platforms make this behavior configurable or have some goofy in-between mode which allows one member of the aggregation to forward traffic.

If the Moonshot were saddled with one of these broken² switching OSes, we'd be in a real pickle: Moonshot cartridges (my m300s, anyway) require PXE in order to become operational, and PXE runs in the option ROM of an individual network interface. Even if that interface could form an one-member aggregation, it wouldn't be able to coordinate its operation with the other interface, and neither of their LACP speaker IDs would match the one chosen by the operating system that eventually gets loaded.

I suppose we could change the switch configuration: Add and remove individual interfaces from aggregations depending on the mode required by the server from one moment to the next, but that's pretty clunky, and the standard anticipated this requirement, so why bother?

It's been suggested to me that running a static (non-negotiated) aggregation could be a solution to the problem, but it introduces ECMP hashing toward the server. If we hash PXE traffic (DCHP, TFTP, etc...) so that it's delivered to the wrong NIC, the server won't boot. With the way ECMP decisions get made in the Broadcom switch stack, this suggestion can work, but only if we eliminate all but one link in and out of the chassis. Why live with spanning tree when we've got an expensive L3 switch and lots of physical uplinks here?

So, what's the trick?
On most platforms, the configuration applied to an interface (VLANs, mode, spanning tree stuff, etc...) is required to match the configuration of the aggregation. If the individual interface doesn't match the aggregate, it gets suspended.

Broadcom, for some reason, didn't see a need to implement this check. In their network OS, the physical interfaces can be configured completely differently from the aggregation. Like this:

 interface 1/0/1,2/0/1  
  vlan pvid 7  
  vlan participation include 7  
  addport 0/3/1  
 interface lag 1  
  no port-channel static  
  vlan participation include 51,61  
  vlan tagging 51,61

This configuration make ports 1/0/1 and 2/0/1 (connected to eth0 and eth1 of the first compute node) access ports in VLAN 7, and makes them want to join aggregation lag 1, which is a trunk carrying VLANs 51 and 61.

If the server sends LACP PDUs, those interfaces aggregate and trunk VLANs 51,61. If the server doesn't send PDUs, the interfaces are access ports in VLAN 7,

Now we can do cool stuff like:

Run the cartridges completely diskless. They boot up via PXE in VLAN 7. After loading a kernel and filesystem into memory, the NICs aggregate into a trunk carrying different VLANs.
Run servers on boot-up through some PXE-controlled self-tests and patching in VLAN 7, then chain-boot into the real (on disk, perhaps?) OS which will aggregate the interfaces.

In both cases, VLAN 7 is gone from the switch ports once the LACP messages roll in. The only way the server would find VLAN 7 again is to de-aggregate the interfaces.

I briefly considered adding port-channel min-links 2 to the aggregation config. Doing so would make sure that the server could see only VLAN 7 OR VLAN 51,61 at any given moment, but never all 3 VLANs at the same time. Doing so kills redundancy, so that plan is out.

It's not really a security mechanism, but it does make the access VLAN pretty-much inaccessible, and reduces the footprint of that broadcast domain much more than merely making it un-tagged would do.

1 Beginning with version 2.0.3.0.^↩
2 Check out 802.3AX-2008, section 5.3.9.^↩

Tuesday, June 30, 2015

Failing to the Cloud - and Back!

I attended Virtualization Field Day 5 last week! The usual Field Day disclaimers apply.

This network guy found himself way outside his comfort zone at a Virtualization event, but I had a fantastic time, and I learned a lot.

One of the things that really struck me was just how much virtualization platforms depend on mucking around with block storage in use by VMs. Half or more of the presentations hinged on it. Frankly, this notion terrifies the UNIX admin in me. I realize that we're not talking about UFS filesystems on SunOS4, but it seems those fragile old systems have really imprinted on me!

One of the VFD presenters was OneCloud Software, which presented a DR-via-Public-Cloud offering. The following bullets describing their solution came from here:

Auto discovers your on-premise assets; data and applications
Provides you with a simple policy engine to set RPO and RTO
Automatically provisions a fully functioning virtual data center in the cloud that mirrors your on-premise data center
Optimizes the economics of your data center in the cloud by eliminating unneeded compute costs and using the most cost-effective storage
Executes on-going data replication to keep the virtual data center in sync with the physical data center based on your RPO choices
Allows you to perform non-disruptive DR testing whenever needed
Provides failover and failback capabilities as needed

Not mentioned here are the provisioning of a VPN tunnel (public cloud to wherever the clients are), and the requisite re-numbering of VM network interfaces and tweaking of DNS records to support running pre-existing VMs in a new facility. This is normal stuff that you'd probably be talking about in a VMware SRM project anyway.

Most interesting to me is the data replication.

First, I'm still getting my head around the idea that it's safe to do this at all. Its obviously very popular, so I guess it's well established within the virtualization community that this is an Okay Thing To Do. I'd sure be thinking hard about any applications that write to block devices directly.

Next, there's the replication bit. VMware's OS-level write flush, snapshot, and dirty block tracking features are certainly involved in keeping the data in the public cloud synced up. I think I understand how that works.

But what about that last bullet? Failback? This is an interesting and key detail. Other folks in the room (who are much more knowledgable than I) were impressed by this feature and what it implies.

What does it imply? It implies data replication in the reverse direction. This is both interesting and hard because the snapshot feature of Amazon EBS presents each snapshot as a complete block device. Snapshots only consume storage space required by deltas, but the deltas themselves aren't directly available.

So, how is OneCloud effecting replication back to the primary data center? They're certainly not sending the entire VM image with every RPO interval. That would be insane, insanely expensive, and it would take forever. The answer from the OneCloud is secret sauce. Bummer.

I've got two guesses about what might be going on here:

They're subverting the block storage or filesystem driver within the VM. We know they're inside the guest OS anyway, because they're fiddling around with network settings and forcing filesystem write flushes. Maybe the block storage driver has been replaced/tweaked so that it sends changes not only to the disk, but also to a replication agent within Amazon EC2. I do not think this is super likely.
An agent running at Amazon EC2 is diff-ing snapshots at regular intervals, and serving up the resulting incremental changes. This is definitely the hard way from a heavy-lifting perspective, but is pretty straightforward.

Wednesday, May 6, 2015

PSA: Linux Does RPF Checking

Twice now I've "discovered" that Linux hosts (even those that aren't doing IP forwarding) do Reverse Path Forwarding checks on incoming traffic.

Both times this has come up was in the context of a multicast application. It resulted in a conversation that went like this:

Application Person: Hey Chris, what's up with the network? My application isn't receiving any traffic.
Me: Um... The routers indicate they're sending it to you. The L3 forwarding counters are clicking. The L2 gear indicates it has un-filtered all of the ports between the router and your access port. Are you sure?
Application Person: My application says it's not arriving.
Me: I now have tcpdump running on your server. The traffic is arriving. Here are the packets. Do they look okay?

In the end, it turns out that the network was operating perfectly fine. The requested traffic was being delivered to the server, on the interface that requested it. It was the routing table within the Linux host that was screwed up.

RPF Checks
Reverse Path Flow checking is a feature that checks to make sure that a packet's ingress interface is the one that would be used to reach the packet's source. If a packet arrives on an interface other than the one matching the "reverse path", the packet is dropped.

RPF checking usually comes up in the context of routers. It's useful to make sure that users aren't spoofing their source IPs, and is a required feature of some multicast forwarding mechanisms, to ensure that packets aren't replicated needlessly.

Linux
Linux boxes, even those that aren't routing multicast packets (Not recommended - the Linux PIM implementation is weak, especially in topologies that invoke turnaround-router operation) also implement RPF filters. If the host has only a single IP interface, you'll never notice the feature, because the only ingress interface is the RPF interface for the entire IP-enabled universe.

Implementing these filters is a little weird when a multi-homed Linux host is operating as a multicast sender or receiver. It's weird because the socket libraries exposed to applications let the application chose which interface should receive (or send!) the traffic. If the application chooses the "wrong" interface (according to the Linux routing table), incoming traffic will be dropped. Tcpdump will see the traffic, but it'll be dropped in the kernel before it reaches the application.

There's a lever to control this behavior:

$ sysctl net.ipv4.conf.eth0.rp_filter

net.ipv4.conf.eth0.rp_filter = 1

It's documented in /usr/src/linux-3.6.6/Documentation/networking/ip-sysctl.txt which says:

rp_filter - INTEGER

0 - No source validation.

1 - Strict mode as defined in RFC3704 Strict Reverse Path

Each incoming packet is tested against the FIB and if the interface

is not the best reverse path the packet check will fail.

By default failed packets are discarded.

2 - Loose mode as defined in RFC3704 Loose Reverse Path

Each incoming packet's source address is also tested against the FIB

and if the source address is not reachable via any interface

the packet check will fail.

Current recommended practice in RFC3704 is to enable strict mode

to prevent IP spoofing from DDos attacks. If using asymmetric routing

or other complicated routing, then loose mode is recommended.

The max value from conf/{all,interface}/rp_filter is used

when doing source validation on the {interface}.

Default value is 0. Note that some distributions enable it

in startup scripts.

CentOS, it turns out, is one of the distributions that enables the feature in strict mode right out of the box. Rather than change the RPF checking, I opted to straighten out the host routing table both times this has come up. Still, I'm not convinced that RPF checking in a host (not even an IP forwarder, let alone running a multicast routing daemon) makes any sense. Given that user space processes are allowed to request multicast traffic on any interface it likes, it doesn't make much sense for the kernel to request that traffic (IGMP) on behalf of the application only to throw it away when the packets arrive.

The kernel documentation above cites RFC3704. I had a read through the RFC to see if it distinguishes between recommended behavior for hosts vs. routers. It does not. The RFC only addresses router behavior. I don't think the authors intended for this behavior to be implemented by hosts at all.

Of course, IP multihoming of hosts without either per-interface routing tables, or running a routing protocol is almost always a bad idea that will lead to sadness. I do not recommend it.

Wednesday, April 22, 2015

Controlling HP Moonshot with ipmitool

I've been driving the HP Moonshot environment over the network with ipmitool, and found it not altogether straightforward. One of the HP engineers told me:

Yeah, we had to jump through some hoops to extend IPMI’s single-system view of the world into our multi-node architecture.

That is exactly why it's confusing. Everything here works reasonably well, but users have to jump through all of the hoops that the product engineers lined up for us.

Compatibility
The build of ipmitool that ships with OS X (2.5b1) doesn't support the Moonshot's double-bridged topology, so I'm using the one that ships with macports (1.8.12). To check whether your version of ipmitool is compatible, run ipmitool -h and look to see whether it supports both the single-bridge (-b, -t) and double-bridge (-B, -T) command line options. If it does, then it's probably okay.

Bridging
Using IPMI over the network with a regular rack server is pretty straightforward. You specify the device by name or IP, the user credentials and the command/query you want to run. That's about it. Such a command might look like this:

 ipmitool —H <IPMI_IP> -U <user> —P <password> —I lanplus chassis identify force

The command above turns on the beacon LED on a server. Most of the options here are obvious. The -I lanplus specifies that we intend to speak over the LAN to a remote host, rather than use IPMI features that may be accessible from within the running OS on the machine. I'm not using the -P <password> option in subsequent examples, rather I use -E which specifies to pull the user password from an environment variable.

Moonshot is quite a bit more complicated than a typical rack mount server. Here's a diagram of the topology from the HP iLO Chassis Management IPMI User Guide:

Moonshot IPMI Topology

While the identify command example does work against moonshot, it turns on the chassis beacon LED. There are also beacon LEDs on each cartridge and on each switch. To manipulate those LEDs, we need to bridge the commands through the Zone MC, to the various devices on the IPMB0 bus.

First, let's get an inventory from the perspective of the Zone MC:

 $ ipmitool -H <IPMI_IP> -EU Administrator -I lanplus sdr list all  
 ZoMC            | Static MC @ 20h    | ok  
 254             | Log FRU @FEh f0.60 | ok  
 IPMB0 Phys Link | 0x00               | ok  
 ChasMgmtCtlr1   | Static MC @ 44h    | ok  
 PsMgmtCtlr1     | Dynamic MC @ 52h   | ok  
 PsMgmtCtlr2     | Dynamic MC @ 54h   | ok  
 PsMgmtCtlr3     | Dynamic MC @ 56h   | ok  
 PsMgmtCtlr4     | Dynamic MC @ 58h   | ok  
 CaMC            | Static MC @ 82h    | ok  
 CaMC            | Static MC @ 84h    | ok  
 CaMC            | Static MC @ 86h    | ok  
 <snip>  
 Switch MC       | Static MC @ 68h    | ok  
 Switch MC       | Static MC @ 6Ah    | ok

From this, we can see that the Zone MC, Chassis MC, and first power supply MC are all at the addresses we'd expect based on having reviewed HP's drawing. Additionally, we can see the the addresses of the remaining power supplies, the switches, and the cartridges (I snipped the output after the first three cartridges).

You can learn more about each of those discovered devices with:

 ipmitool -H <IPMI_IP> -EU Administrator -I lanplus fru print

I've not yet figured out how to relate the cartridge and switch MCs to physical slot numbers other than by flipping on and off the beacon LEDs, or inspecting serial numbers. I think it's supposed to be possible with the picmg addrinfo command, but I've yet to figure out how to relate that output to physical cartridge and switch slots.

Okay, there's one more thing to note in the table above: the IPMB0 bus to which all of our downstream controllers are attached is channel 0x00. We need to know the address here because these controllers potentially have many interfaces. When sending bridged commands, we need to send both the channel number and the target address.

So, now we've got everything we need in order to flip on the beacon LED at cartridge #1:

 ipmitool -H <IPMI_IP> -EU Administrator -I lanplus -b 0 -t 0x82 chassis identify force

Yes, the command is chassis identify, but it doesn't illuminate the chassis LED. That's because the command is executing within the context of a cartridge controller. The command above should light the LED on cartridge #1.

Cool, so we're now talking through the Zone MC to the individual cartridges, switches and power supplies! But what about the servers? Moonshot supports multiple servers per cartridge, so we're still one hop away. That's why we need double bridging.

Double Bridging
Double bridged commands work the same as single bridged, except that we have to specify the channel number and target address at each of two layers. The first hop is specified with -B and -T, second hop with -b and -t.

First, we need to get the layout of a cartridge controller. We'll run the sdr list all command again, but bridge it through to the cartridge in slot 1:

 $ ipmitool -H <IPMI_IP> -EU Administrator -I lanplus -b 0 -t 0x82 sdr list all  
 01-Front Ambient | 27 degrees C   | ok  
 02-CPU           | 0 degrees C    | ok  
 03-DIMM 1        | 26 degrees C   | ok  
 04-DIMM 2        | 26 degrees C   | ok  
 05-DIMM 3        | 28 degrees C   | ok  
 06-DIMM 4        | 27 degrees C   | ok  
 07-HDD Zone      | 27 degrees C   | ok  
 08-Top Exhaust   | 26 degrees C   | ok  
 09-CPU Exhaust   | 27 degrees C   | ok  
 CaMC             | Static MC @ 82h  | ok  
 SnMC             | Static MC @ 72h  | ok  
 SnMC 1           | Log FRU @01h c1.62 | ok

This is a single-node cartridge (m300 cartridges are all I've got to play with), but, consistent with quad-node cartridges, they require a bridging hop. The SnMC at 0x72 refers to the lone server on this cartridge. I assume that multi-node cartridges would list several SnMC resources here.

Unfortunately, when the sdr list all command is run against the cartridge controller, it doesn't reveal anything about the downstream transit channel like it did when we ran it against the Zone MC. The channel number we need for the second bridge hop is 7. It's documented in chapter 3 of the HP iLO Chassis Management IPMI User Guide.

So, putting this all together, we'll set node 1 on cartridge 1 to boot from its internal HDD, and then set it to boot just once via PXE:

 $ ipmitool -H <IPMI_IP> -EU Administrator -B 0 -T 0x82 -b 7 -t 0x72 -I lanplus chassis bootdev disk options=persistent  
 $ ipmitool -H <IPMI_IP> -EU Administrator -B 0 -T 0x82 -b 7 -t 0x72 -I lanplus chassis bootdev pxe

Some Other Useful Commands
Non-bridged commands:

lan print
sel list
chasis status
chassis identify (lights the LED for 15 seconds)
chassis identify <duration> (0 turns off the LED)
chassis identify force (lights the LED permanently)

Single-bridged commands:

chassis status
chassis identify (all variants above)

Double-bridged node commands:

chassis power status
chassis power on
chassis power off
chassis power cycle (only works when node power is on)
sol activate (connects you to the node console via Serial-Over-LAN)
sol deactivate (kills an active sol session)

Gotchas

I've found that the web interface doesn't indicate beacon LED status when IPMI sets it to expire (the default behavior).

Attempts to use the Virtual Serial Port from the iLO command-line fail when an IPMI SOL session is active. The iLO CLI prompts you to "acquire" the session, but this fails too.

Setting node boot and power options too quickly (one after the other) seems to cause them to fail.

Node boot order settings configured via IPMI while node power is off work, but the iLO command line doesn't recognize that they've happened until the node is powered on.

Update 2016/01/25
Version 1.40 of the Moonshot chassis manager firmware introduced the possibility of creating Operator class users who are restricted to viewing/manipulating only a subset of cartridges. This has been handy in a development environment, but there are a couple of gotchas to using IPMI capabilities as an Operator user.

The first gotcha is that the user needs to explicitly declare the intended privilege level specifying -L OPERATOR at the ipmitool command line. I'm not clear on why the privilege level can't be inferred at the chassis manager by looking at the passed credentials, but apparently it cannot.

The second gotcha: By default, the SOL capability requires ADMINISTRATOR class privilege to operate. You can see this by sending the sol info command via ipmitool as an ADMINISTRATOR class user. This requirement seems odd to me: OPERATORs are allowed to interact with the virtual serial port through the SSH interface without any additional configuration.

It is possible to allow OPERATOR users to use the IPMI SOL capability by changing the required privilege level. Do that by sending sol set privilege-level operator via ipmitool with ADMINISTRATOR credentials.