Fragmentation Needed: October 2015

I listened to episode 9 of the excellent Datanauts podcast with Ethan Banks and Chris Wahl recently.

Great job with this one, guys. I can tell how engaged I am in a podcast by how often I want to interrupt you :)

For this episode, that was lots of times!

Since I couldn't engage during the podcast, I'm going to have a one-sided discussion here, about the topics that grabbed my attention.

RARP?
Chris explained that the 'notify switches' feature of an ESXi vSwitch serves to update the L2 filtering table on upstream physical switches. This is necessary any time a VM moves from one physical link (or host) to another.

Updating the tables in all of the physical switches in the broadcast domain can be accomplished with any frame that meets the following criteria:

Sourced from the VM's MAC address
Destined for an L2 address that will flood throughout the broadcast domain
Specifies an Ethertype that the L2 switches are willing to forward

VMware chose to do it with a RARP frame, probably because it's easy to spoof, and shouldn't hurt anything. What's RARP? It's literally Reverse ARP. Instead of a normal ARP query, which asks: "Who has IP x.x.x.x?" RARP's question is: "Who am I?"

It's like a much less feature-packed alternative to DHCP:

RARP can't be relayed by routers (as far as I know)
RARP doesn't share subnet mask (RARP clients use ICMP type 17 code 0 instead)
RARP doesn't tell the client about routers or DNS, NTP, or WINS servers, etc...

I used to use RARP with SPARC-based systems: Bootup of diskless workstations, and Jumpstart-based server installs. A decade ago, I even got in.rarpd and ICMP subnet replies, tftp, nfs and all of the other services configured on my macbook so that I could Jumpstart large trading platforms from it. Man, that would have been an epic blog post...

Okay, so why RARP when GARP will do?

The answer has to do with what a hypervisor can reasonably be expected to know about the guest. Sending a RARP is easy, because it only requires knowledge of the vNIC's MAC address. No problem, because that vNIC is part of ESXi.

Sending a GARP, on the other hand, requires that the sender know the IP address of the guest, which isn't necessarily going to be speedy (or even possible) for a hypervisor to know. Heck, the guest might not even speak IP! Then what?

Hey, what about multicast!

It feels like the guys missed an opportunity to talk about something cool here. Or maybe not, Greg would tell me that nobody cares about multicast.

When a guest moves, ESXi also has to jump through some hoops to un-cork the L2 multicast filters on the new physical port used by the guest. Unlike the case of the Unicast filters, where the hypervisor just knows the guest's MAC address, it can't know the guest's multicast membership, so it can't spoof any upstream messages.

Instead, it spoofs a downstream message to the guest. In it is an IGMP general membership query demanding immediate (response time = 0) information about membership in any multicast groups. The guest responds with its interest and those responses are forwarded out to the physical network where filters get adjusted.

Multiple vSwitches

Chris and Ethan never spelled this out explicitly, but I'm under the impression that running multiple vSwitches (for isolation or whatever reason) requires that you have physical NICs/cables/pSwitch ports for each vSwitch.

If your DMZ subnet, internal networks and storage LANs all live on different physical switches, then you'll need to cable different pNICs to each of these environments. Assigning each of these pNICs to its own vSwitch just seems like good sense in that case.

On the other hand, you probably wouldn't create separate vSwitches just because some notion of isolation feels like a good idea, because doing so requires you to produce pNICs, cabling and pSwitch ports for each of those vSwitches.

Maybe I'm off base here? It felt like Ethan was saying "Oh, so maybe I do want to create lots of vSwitches sometimes..." without addressing the physical side requirements that will come along with that decision.

Hating on LACP

I think the guys missed the boat on this one.

There's a bandwidth argument (or several!) to be made here. Consider LACP's load balancing capabilities:

Per-flow load balancing (or at least per src/dst IP pair - not sure about ESXi limitations here)
The pool of load-balanced objects includes all flows in/out of the host
Balancing decision made on a per-frame basis

Those win out over the migratory per-guest vPort pinning scheme from every angle:

Per-guest load balancing
A single guest can never exceed the speed of a single link
The pool of balanced objects is much smaller (guests)
Balancing decision made at long intervals
With a slightly clunky migration scheme (traffic - particularly IP multicast will be lost on rebalance with large L2 topologies).

But frankly, the bandwidth and migration reasons aren't the ones that matter most when it comes to making the LACP vs. pinning decision.

I wish that Chris had elaborated on the complications of using LACP on ESXi, because the complications of not using LACP in a sophisticated enterprise environment are substantial. The issues that matter here have to do with network topology, forwarding paths and failure modes in the physical network.

If we're talking about all pNICs linking to a single ToR switch, then it really doesn't matter. LACP and load based pinning schemes will accomplish the same job.

But when we're probably talking about an enterprise environment with redundant ToR switches and fancy features like MLAG, the tide shifts toward LACP. Heck, the switches are probably already doing MLAG on some other interfaces, and this matters in some failure modes. There some non-obvious ways to go wrong here:

Traffic between guests on two different hosts will take a needlessly long path if their randomly-chosen pNICs land on different physical switches.
In a Cisco Nexus world, the pSwitch ports will be known as orphan ports. These are loaded with crazy caveats in various failure modes.

These are not insurmountable obstacles, but they definitely shouldn't be taken lightly. Orphan ports in particular are a nasty can of worms.

It's long been said that we'll never see an SFP+ transceiver for 10GBASE-T media. Too much power, too small package, too much heat, etc...

I'm not sure that never is quite right. There's this wonderful/horrible contraption:

Dawnray SFP+ module. Photo found here.

It's huge. It's ugly. Its covered with fins, so it must be hot. The data sheet says it consumes 7 Watts. Where's it getting 7W? Not from the SFP+ interface on the switch... Note the power cord attached to the module. It uses a wall wart!

This is not an elegant solution, but 10GBASE-T is hard, and this is the best we've got.

Until now.

/u/asdlkf recently pointed out on reddit that HP have published a data sheet ¹ for a much more elegant SFP+ module for 10GBASE-T.

There were rumors that this module was going to have a giant heatsink and protrude far beyond the SFP+ slot, but it turns out that's not the case. It looks really good, and it's only a bit longer than some 1000BASE-T modules that I have kicking around the office.

The module uses only 2.3W (no wall wart required, but plugging in lots of them will still tax most switches), but is a bit of a compromise in that it can only push 10GBASE-T 30m (the standard calls for 100m).

I'm not advocating for 10GBASE-T (I suspect Ferro would never speak to me again!) I'd rather use DAC or DAO transceivers for intra-rack links and optical transceivers inter-rack because they're better than 10GBASE-T in so many ways:

Lower power
Lower latency
Lower bit error rate
Smaller cable diameter
Lower in-rack cost
Longer inter-rack runs

But I'm sure this nifty transceiver will solve some problems. Congratulations HP, for being first to market with a usable option.

1 HP's data sheet also has a funny typo in the Environment section. I may have to wait for global warming to melt the polar ice caps, raising sea level a bit before I can deploy one of these units.^↩

Fragmentation Needed

Friday, October 16, 2015

Musings on Datanauts #9

Thursday, October 15, 2015

HP Is Shipping Unicorns Now: 10GBASE-T SFP+ Module