Tuesday, September 23, 2014

New fiber connector is nifty

Corning has recently teamed up with Intel in introducing some new optical equipment. Corning's contribution (fibers, connectors) likely mean there will be some unfamiliar looking optical infrastructure in your data center soon.

The fiber is a new 1310nm singlemode variety that Corning touts as "bend-insensitive". The minimum allowable bend radius of this fiber is 7.5mm. This is impressive, but expected under ITU-T G.657.B.

More interesting is the MXC connector. This is a push-on connector with a locking tab like the 8P8C connectors used for twisted pair Ethernet. It supports up to 64 fiber strands, each running at 25Gb/s.
MXC connector. Image from Corning-Intel Whitepaper.

The only place I've seen this fiber or connector in use is on a prototype 100G CLR4 transceiver shot by Greg Ferro at the Intel Developer Forum a couple of weeks ago.
Greg's shot of CLR4 transceivers with MXC connectors.
The CLR4 alliance explains that their approach puts four channels running at 25Gb/s each onto a single pair of single mode fiber, and specifically calls for LC connectors on the transceiver, so I'm a little confused about why these transceivers are sporting MXC connectors.

It seems the MXC connector will be used not directly on the transceivers, but for rapid deployment of many strands of structured fiber between racks. There will probably be a fan-out cassette of some sort to turn each MXC-terminated bundle into 32 LC connectors.

Anyway, none of that is the most interesting part of the connector. The most interesting bit is the termination of the optical fibers in the MXC connector. Did you notice that you can actually see the 64 strands (maybe even count them) in that first picture? Consider the following photo of an MPO connector for comparison:
MPO connector from Wikipedia
Most folks don't even notice the individual fiber strands on an MPO connector because they're distracted by the holes for the alignment pins. But you can actually see the 64 ends on the MXC connector. What's going on here?

Still from Intel-Corning video on youtube
The MXC connector includes tiny lenses which expand the light beam at the point where connectors meet. The beam expansion mitigates alignment and contamination issues common to traditional connectors which align fibers face-to-face.

It's pretty neat, though I'm a bit confused about some of the claims made in the various Intel and Corning announcements. The video linked above compares these new fibers to fibers in legacy connectors which have "50um fiber face" and claims that the 180um lens expands the diameter of the mating surface "almost four times"

Clearly they're not talking about single-mode when making these comparisons, but Intel is pushing single-mode based CLR4 transceivers.

Meh. I'm sure this will all become clear when products start shipping. For now I just thought it was interesting to note that these new high density connectors have baked-in lenses.

Wednesday, September 17, 2014

Making better use of libvirt hooks

Libvirtd includes handy hooks for doing management work at various phases in the lifecycle of the libvirt daemon, attached networks, and virtual machines. I've been using these hooks for various things and have found them particularly useful for management of short-lived Linux containers. Some of my use cases for these hooks include:
  • changing network policy
  • instantiating named routing tables
  • creating ramdisks for use by containers 
  • pre-loading data before container startup
  • archiving interesting data at container shutdown 
  • purging data at container destruction

Here's how the hooks work on a system with RedHat lineage:

The hook scripts live in /etc/libvirt/hooks. The scripts are named according to their purpose. I'm focusing right now on the LXC hook which is named /etc/libvirt/hooks/lxc. Note that neither the directory, nor the scripts exist by default.

The lxc script is called several times in each container's lifecycle, and is passed arguments that specify the libvirt domain id and the lifecycle phase. During startup and shutdown of one of my LXC systems, the script gets called five times, like this:

 /etc/libvirt/hooks/lxc MyAwesomeContainer prepare begin -  
 /etc/libvirt/hooks/lxc MyAwesomeContainer start begin -  
 /etc/libvirt/hooks/lxc MyAwesomeContainer started begin -  
 /etc/libvirt/hooks/lxc MyAwesomeContainer stopped end -  
 /etc/libvirt/hooks/lxc MyAwesomeContainer release end -  

In addition to having those command line arguments, each time the script is run, it receives the guest's entire libvirt definition (XML) on STDIN.

My script was parsing the command line arguments to figure out the domain ID and lifecycle phase, then calling countless modules to do different tasks depending on the context. It also read in some external configuration files which specified various parameters for each domain.

The script quickly became an absolute monster. It was doing too much, had too many dependencies, too many modules baked in, and was hard to test without interrupting all of the guest systems.

Two changes to the approach brought this facet of guest management under control:

Change #1: Run-parts Style
The first thing I did was to rip all of the smarts out of the script. All it does now is call other scripts in a manner similar to run-parts. It looks something like this:

 # Collect input from STDIN, stick it in $DATA  
 # Rip the domain ID out of $*  
 DOMAIN=$1; shift  
 # Join the remaining command line bits with '_' chars, then add '.d'  
 DIR=${0}_$(/bin/echo "$*" | /bin/sed -e 's/ /_/g' -e 's/$/.d/')  
 # Run all numbered-and-executable scripts in DIR with the usual CLI arguments,  
 # passing data collected on our STDIN to the script's STDIN.  
 for script in $DIR/[0-9]* ; do  
  if [ -x $script ] ; then  
   /bin/echo -n "$DATA" | $script $DOMAIN $*  

It now weighs in at a svelte 8 lines. It uses its own name ($0 is "lxc" in this case) and the passed command line arguments (not the first one which identifies the domain ID) to divine the name of a directory full of other scripts which need to be run. Then, it runs whatever scripts it finds in that directory. So, when it's called like this:
 /etc/libvirt/hooks/lxc MyAwesomeContainer prepare begin -  

It runs all of the scripts it finds in:

Those scripts must be executable, and the must be named with a leading digit to specify run order. It's an alphabetic sort, so consider using leading zeros: 11 sorts before 5, but not before 05.

When the new wrapper calls the child scripts, it uses the same command line arguments, and feeds the same data on STDIN. There's an opportunity for things to go weird if libvirt used arguments with spaces in them, but it doesn't do that, so meh.

Now, rather than having a section of one big script that looks like this:
 DOMAIN=$1; shift  
 case "$*" in  
     "prepare begin -" )  
         mkdirs_function $DOMAIN
         setup_ramdisk_function $DOMAIN
         preload_data_function $DOMAIN
     "release end -" )  
         archive_data_function $DOMAIN
         purge_data_function $DOMAIN
         destroy_ramdisks_function $DOMAIN

There is a directory structure that looks like this:
              ├── lxc  
              ├── lxc_prepare_begin_-.d  
              │   ├── 01_mkdirs.py  
              │   ├── 02_setup_ramdisks.py  
              │   └── 03_preload.py  
              └── lxc_release_end_-.d  
                  ├── 01_archive.py  
                  ├── 02_rm-rf.py  
                  └── 03_destroy_ramdisks.py  

Each of those scripts operates independently, and has all of the data it needs on the command line and STDIN. They run in order.

If I discover that I need to do something in the "started begin -" lifecycle phase, it's as easy as creating the /etc/libvirt/hooks/lxc_started_begin_-.d directory and dropping some new scripts in there.

Change #2: Libvirt Definition Supports Metadata
The second change is that I began using the libvirt guest definition to specify exactly what work needs to be done by the hook scripts. The various functions (create ramdisk, preload data, etc...) in my old script had been using configuration files, patterns based on the domain ID and so forth. This way is much better.

The libvirt XML schema is quite strict, but it does allow a <metadata> element which will cart around anything you care to define. I use it to specify ramdisk mount points and sizes, pre-load and post-cleanup stuff, etc...

For example, there are some guests which have directories that I don't want to persist after the guest container is destroyed. The libvirt definition for those guests includes the following:

This blob of XML does nothing by itself. Libvirt totally ignores it. The action happens when /etc/libvirt/hooks/lxc_release_end_-.d/02_rm-rf.py gets called.

That guy looks for <dir> elements in <metadata><post-rm-rf>, and removes the specified directories. It's named '02' to ensure that it runs after 01_archive.py, which squirrels away logfiles and other data that I might want to retain.

Because this script receives the libvirt definition XML on STDIN, the information about what to remove travels with the guest. On shutdown of containers that don't include the <post-rm-rf> element, nothing happens.

Here's what's in 02_rm-rf.py:
 #!/usr/bin/env python  

 def xml_from_stdin():  
   import sys  
   import xml.etree.ElementTree as ET  
   tree = ET.parse(sys.stdin)  
 def rm-rf(dir):  
   import shutil  
   shutil.rmtree(dir, ignore_errors=True)  
 def main():  
   root = xml_from_stdin()  
   for dir in root.find('metadata').findall('post-rm-rf'):  
 if __name__ == "__main__":  

Another example:
The 01_archive.py script is a little more complicated. Here's the relevant XML:

There are two things going on inside the <archive> element. This first are the <dircopy> bits, which specify data to keep, and where to put it.

The second are the <datestr> bits, which specify strftime format strings, and the related strings in the destination path which should be replaced by an appropriately formatted timestamp.

Now, any <dst> element which includes the string %DATE% will find that string replaced with the current date, formatted like this: 2014-09-17. Similarly, any <dst> element which includes the string blahblahblah will find blahblahblah replaced by a nicely formatted timestamp. There's nothing magic about the percent symbols, the <replace> element can be any string.

The 01_archive.py script is here.

Final Thoughts
My guest management is much more agile with libvirt carrying around extra instructions in <metadata> and the modularity afforded by run-parts style hooks functions.

Each time I want a new thing to happen to a guest, I need to do three things:
  • Invent some XML describing specifics of the operation, jam it into <metadata>
  • Write a script which parses the XML, implements the new feature
  • Drop that script onto all of my libvirt nodes in the appropriate lifecycle_args.d directory
One gotcha which has tripped me up more than once: libvirtd might not notice that the hooks script has been installed (or edited?) until libvirtd gets restarted. Maybe I'll remember this tidbit after having blogged about it. [Edit: Nope. I was just surprised to read this section from an earlier draft.]

Saturday, September 13, 2014

No more duplicate frames with Gigamon Visibility Fabric


Gigamon presented their Visibility Fabric Architecture at Network Field Day 8.  You can watch the presentation at Tech Field Day.

CC BY-NC-ND 2.0 licensed photo by smif
Needs deduplication
One of the interesting facets of Gigamon's solution was it's ability to do real-time de-duplication of captured traffic as it traverses the Visibility Fabric (a hierarchy of monitoring data sources and advanced aggregation switches). I've spent some time around proactively-deployed network taps, but never seen this capability before, and I think it's pretty nifty.

The Problem
Let's say you've got taps and mirror ports deployed throughout your network. They're on the uplinks from data center access switches, virtually attached to vSwitches for collecting intra-host VM traffic, at the WAN and Internet edge, on the User distribution tier, etc...  All of these capture points feed various analysis tools via Gigamon's Visibility Fabric. It's likely that a given flow will be captured and fed into the monitoring infrastructure at more several points. Simplistic capture-port-to-tool-port forwarding rules will result in a given packet being delivered to each interested tool more than once, possibly several times.

This can be problematic because it confuses the analysis tool (ZOMG, look at all of those TCP retransmissions!), burns precious bandwidth on links within both the tap aggregation infrastructure and to the analysis tool, skews timestamps by introducing link congestion (not a problem for analyzers which leverage Gigamon's timestamp trailer) and wastes disk space when traffic is being written to disk.

Gigamon's Answer
When configured to suppress duplicates, Gigamon equipment will remember (for 50ms, by default) traffic it's sent to a given tool, and suppress delivery of duplicate copies collected in other parts of the infrastructure. The frames don't have to be exactly the same: A packet routed from one subnet to another will have a different Ethernet header. Apparently this is okay, and doesn't confuse the de-duplication filters. 802.1Q tags are probably okay too, though I think I heard that recognizing a VXLAN-encapsulated frame as a duplicate of a native Ethernet frame is beyond the capability of the de-dupe feature.

Even though Gigamon bufferes traffic for 50ms, there's no added latency introduced when suppressing duplicates. It's not like a jitter buffer. Rather, the first copy of a given packet is always delivered immediately. The buffer is used only to recognize duplicates which need to be suppressed.

Should you de-duplicate the traffic? Maybe. It depends on the tool and on the problem you and the tool are attacking.

Tools focused on application-layer analysis (IDS/IPS, DLP, user experience monitoring, etc...) probably don't generally need to see the same traffic as it appears at various points in the network and will benefit from de-duplication, while tools focused on infrastructure performance monitoring and post-incident analysis will want to see everything, and are likely deployed in such a way that they're already set up to differentiate between multiple copies of the same captured data.

I was a delegate and Gigamon was a presenter at NFD8. Gigamon paid some of the costs associated with putting on the event, and Gestalt Media covered the costs associated with getting me there, feeding me, etc... Also, I collected some vendor tchotchkes along the way. Neither Gigamon nor Gestalt IT bought my opinions, nor space on this blog. I writing about them because I found their presentation interesting, and for no other reason.