Friday, October 16, 2015

Musings on Datanauts #9

I listened to episode 9 of the excellent Datanauts podcast with Ethan Banks and Chris Wahl recently.

Great job with this one, guys. I can tell how engaged I am in a podcast by how often I want to interrupt you :)

For this episode, that was lots of times!

Since I couldn't engage during the podcast, I'm going to have a one-sided discussion here, about the topics that grabbed my attention.

Chris explained that the 'notify switches' feature of an ESXi vSwitch serves to update the L2 filtering table on upstream physical switches. This is necessary any time a VM moves from one physical link (or host) to another.

Updating the tables in all of the physical switches in the broadcast domain can be accomplished with any frame that meets the following criteria:

  • Sourced from the VM's MAC address
  • Destined for an L2 address that will flood throughout the broadcast domain
  • Specifies an Ethertype that the L2 switches are willing to forward
VMware chose to do it with a RARP frame, probably because it's easy to spoof, and shouldn't hurt anything. What's RARP? It's literally Reverse ARP. Instead of a normal ARP query, which asks: "Who has IP x.x.x.x?" RARP's question is: "Who am I?"

It's like a much less feature-packed alternative to DHCP:
  • RARP can't be relayed by routers (as far as I know)
  • RARP doesn't share subnet mask (RARP clients use ICMP type 17 code 0 instead)
  • RARP doesn't tell the client about routers or DNS,  NTP, or WINS servers, etc...
I used to use RARP with SPARC-based systems: Bootup of diskless workstations, and Jumpstart-based server installs. A decade ago, I even got in.rarpd and ICMP subnet replies, tftp, nfs and all of the other services configured on my macbook so that I could Jumpstart large trading platforms from it. Man, that would have been an epic blog post...

Okay, so why RARP when GARP will do?
The answer has to do with what a hypervisor can reasonably be expected to know about the guest. Sending a RARP is easy, because it only requires knowledge of the vNIC's MAC address. No problem, because that vNIC is part of  ESXi.

Sending a GARP, on the other hand, requires that the sender know the IP address of the guest, which isn't necessarily going to be speedy (or even possible) for a hypervisor to know. Heck, the guest might not even speak IP! Then what?

Hey, what about multicast!
It feels like the guys missed an opportunity to talk about something cool here. Or maybe not, Greg would tell me that nobody cares about multicast.

When a guest moves, ESXi also has to jump through some hoops to un-cork the L2 multicast filters on the new physical port used by the guest. Unlike the case of the Unicast filters, where the hypervisor just knows the guest's MAC address, it can't know the guest's multicast membership, so it can't spoof any upstream messages.

Instead, it spoofs a downstream message to the guest. In it is an IGMP general membership query demanding immediate (response time = 0) information about membership in any multicast groups. The guest responds with its interest and those responses are forwarded out to the physical network where filters get adjusted.

Multiple vSwitches
Chris and Ethan never spelled this out explicitly, but I'm under the impression that running multiple vSwitches (for isolation or whatever reason) requires that you have physical NICs/cables/pSwitch ports for each vSwitch.

If your DMZ subnet, internal networks and storage LANs all live on different physical switches, then you'll need to cable different pNICs to each of these environments. Assigning each of these pNICs to its own vSwitch just seems like good sense in that case.

On the other hand, you probably wouldn't create separate vSwitches just because some notion of isolation feels like a good idea, because doing so requires you to produce pNICs, cabling and pSwitch ports for each of those vSwitches.

Maybe I'm off base here? It felt like Ethan was saying "Oh, so maybe I do want to create lots of vSwitches sometimes..." without addressing the physical side requirements that will come along with that decision.

Hating on LACP
I think the guys missed the boat on this one.

There's a bandwidth argument (or several!) to be made here. Consider LACP's load balancing capabilities:
  • Per-flow load balancing (or at least per src/dst IP pair - not sure about ESXi limitations here)
  • The pool of load-balanced objects includes all flows in/out of the host
  • Balancing decision made on a per-frame basis
Those win out over the migratory per-guest vPort pinning scheme from every angle:
  • Per-guest load balancing
  • A single guest can never exceed the speed of a single link
  • The pool of balanced objects is much smaller (guests)
  • Balancing decision made at long intervals
  • With a slightly clunky migration scheme (traffic - particularly IP multicast will be lost on rebalance with large L2 topologies).
But frankly, the bandwidth and migration reasons aren't the ones that matter most when it comes to making the LACP vs. pinning decision.

I wish that Chris had elaborated on the complications of using LACP on ESXi, because the complications of not using LACP in a sophisticated enterprise environment are substantial. The issues that matter here have to do with network topology, forwarding paths and failure modes in the physical network.

If we're talking about all pNICs linking to a single ToR switch, then it really doesn't matter. LACP and load based pinning schemes will accomplish the same job.

But when we're probably talking about an enterprise environment with redundant ToR switches and fancy features like MLAG, the tide shifts toward LACP. Heck, the switches are probably already doing MLAG on some other interfaces, and this matters in some failure modes. There some non-obvious ways to go wrong here:
  • Traffic between guests on two different hosts will take a needlessly long path if their randomly-chosen pNICs land on different physical switches.
  • In a Cisco Nexus world, the pSwitch ports will be known as orphan ports. These are loaded with crazy caveats in various failure modes.
These are not insurmountable obstacles, but they definitely shouldn't be taken lightly. Orphan ports in particular are a nasty can of worms.

No comments:

Post a Comment