Tuesday, June 30, 2015

Failing to the Cloud - and Back!

I attended Virtualization Field Day 5 last week! The usual Field Day disclaimers apply.

This network guy found himself way outside his comfort zone at a Virtualization event, but I had a fantastic time, and I learned a lot.

One of the things that really struck me was just how much virtualization platforms depend on mucking around with block storage in use by VMs. Half or more of the presentations hinged on it. Frankly, this notion terrifies the UNIX admin in me. I realize that we're not talking about UFS filesystems on SunOS4, but it seems those fragile old systems have really imprinted on me!

One of the VFD presenters was OneCloud Software, which presented a DR-via-Public-Cloud offering. The following bullets describing their solution came from here:

  • Auto discovers your on-premise assets; data and applications
  • Provides you with a simple policy engine to set RPO and RTO
  • Automatically provisions a fully functioning virtual data center in the cloud that mirrors your on-premise data center
  • Optimizes the economics of your data center in the cloud by eliminating unneeded compute costs and using the most cost-effective storage
  • Executes on-going data replication to keep the virtual data center in sync with the physical data center based on your RPO choices
  • Allows you to perform non-disruptive DR testing whenever needed
  • Provides failover and failback capabilities as needed
Not mentioned here are the provisioning of a VPN tunnel (public cloud to wherever the clients are), and the requisite re-numbering of VM network interfaces and tweaking of DNS records to support running pre-existing VMs in a new facility. This is normal stuff that you'd probably be talking about in a VMware SRM project anyway.

Most interesting to me is the data replication.

First, I'm still getting my head around the idea that it's safe to do this at all. Its obviously very popular, so I guess it's well established within the virtualization community that this is an Okay Thing To Do. I'd sure be thinking hard about any applications that write to block devices directly.

Next, there's the replication bit. VMware's OS-level write flush, snapshot, and dirty block tracking features are certainly involved in keeping the data in the public cloud synced up. I think I understand how that works.

But what about that last bullet? Failback? This is an interesting and key detail. Other folks in the room (who are much more knowledgable than I) were impressed by this feature and what it implies.

What does it imply? It implies data replication in the reverse direction. This is both interesting and hard because the snapshot feature of Amazon EBS presents each snapshot as a complete block device. Snapshots only consume storage space required by deltas, but the deltas themselves aren't directly available.

So, how is OneCloud effecting replication back to the primary data center? They're certainly not sending the entire VM image with every RPO interval. That would be insane, insanely expensive, and it would take forever. The answer from the OneCloud is secret sauce. Bummer.

I've got two guesses about what might be going on here:

  1. They're subverting the block storage or filesystem driver within the VM. We know they're inside the guest OS anyway, because they're fiddling around with network settings and forcing filesystem write flushes. Maybe the block storage driver has been replaced/tweaked so that it sends changes not only to the disk, but also to a replication agent within Amazon EC2. I do not think this is super likely.
  2. An agent running at Amazon EC2 is diff-ing snapshots at regular intervals, and serving up the resulting incremental changes. This is definitely the hard way from a heavy-lifting perspective, but is pretty straightforward.