Fragmentation Needed: 2014

Monday, November 17, 2014

L2 DCI failure remediation with proxy ARP

I had the pleasure of taking an interstate road trip with Ethan Banks last weekend. Naturally, we talked shop a bit.

Somewhere in Connecticut we were speculating about the possibility of using host routes and proxy ARP to restore connectivity between members of a subnet when an L2 DCI fails.

Would a router produce proxy ARP replies for destinations which are members of a directly connected network?

I labbed it up to find out.

Don't do it!

The routers run different HSRP groups. One per data center.

Hosts are configured to use the local HSRP address as their default gateway. There is no FHRP localization other than the configuration on the hosts. This is not a scheme that facilitates VM mobility between sites.

Each router uses an IP SLA monitor to check the L2 interconnect. This is accomplished by pinging the other site's HSRP address. The IP SLA monitor drives a tracking object (track 1). The tracking object drives a second tracking object (track 2) with inverted logic: when the routers can't ping each other track 2 transitions to "up":

 ! IP SLA monitor on R1
 ip sla 1
  icmp-echo 192.168.1.2 source-interface FastEthernet0/1
 ip sla schedule 1 life forever start-time now
 ! track 1 follows IP SLA 1
 track 1 ip sla 1
 ! track 2 follows IP SLA 1 with inverted up/down state
 track 2 list boolean and
  object 1 not

Finally, each router is configured with a static host routes for local systems. These routes depend on track 2 and are redistributed into the IGP:

 ! Configuration on R1
 ip route 192.168.1.11 255.255.255.255 FastEthernet0/1 track 2
 router ospf 1
  log-adjacency-changes
  redistribute static subnets

When the L2 DCI is healthy, the following is true:

IP SLA 1 is 'up'
track 1 is 'up'
track 2 is 'down'
the host routes are withdrawn

In other words, everything works normally. There are no host routes in the IGP, and no proxy ARP nonsense going on.

When the DCI fails, interesting things happen. First, the IP SLA monitor on both routers notices the failure and the tracking objects change state:

 *Nov 17 13:50:23.911: %TRACKING-5-STATE: 1 ip sla 1 state Up->Down  
 *Nov 17 13:50:24.443: %TRACKING-5-STATE: 2 list boolean and Down->Up

The static routes comes to life:

 R1#show ip route static  
    192.168.1.0/24 is variably subnetted, 3 subnets, 2 masks  
 S    192.168.1.11/32 is directly connected, FastEthernet0/1

The other learns the host routes:

 R2#show ip route ospf  
    192.168.1.0/24 is variably subnetted, 3 subnets, 2 masks  
 O E2  192.168.1.11/32 [110/20] via 192.168.0.1, 00:14:02, FastEthernet0/0

Now what will happen when Host_B tries to ping Host_A?

R2's host route (learned via the IGP) causes him to believe that Host_A is best via the IGP. Proxy ARP is enabled. R2 responds to Host_B's ARP query on behalf of Host_A:

 R2#debug ip arp  
 R2#  
 *Nov 17 14:07:07.791: IP ARP: rcvd req src 192.168.1.12 ca04.12d6.001c, dst 192.168.1.11 FastEthernet0/1  
 *Nov 17 14:07:07.791: IP ARP: sent rep src 192.168.1.11 0000.0c07.ac02, dst 192.168.1.12 ca04.12d6.001c FastEthernet0/1

The ping succeeds. Host_B's ARP table includes an HSRP MAC entry (the standby process on R2) in the Host_A entry:

 Host_B#ping 192.168.1.11  
 Type escape sequence to abort.  
 Sending 5, 100-byte ICMP Echos to 192.168.1.11, timeout is 2 seconds:  
 .!!!!  
 Success rate is 80 percent (4/5), round-trip min/avg/max = 44/54/64 ms  
 Host_B#show ip arp 192.168.1.11  
 Protocol Address     Age (min) Hardware Addr  Type  Interface  
 Internet 192.168.1.11      0  0000.0c07.ac02 ARPA  Ethernet1/0

Disclaimer: Don't do this. ARP timers are way too long. Everything about this is horrible.

This was strictly an experiment to discover whether a more specific route would trigger a router to proxy ARP on behalf of a system which should be directly connected.

Tuesday, September 23, 2014

New fiber connector is nifty

Corning has recently teamed up with Intel in introducing some new optical equipment. Corning's contribution (fibers, connectors) likely mean there will be some unfamiliar looking optical infrastructure in your data center soon.

The fiber is a new 1310nm singlemode variety that Corning touts as "bend-insensitive". The minimum allowable bend radius of this fiber is 7.5mm. This is impressive, but expected under ITU-T G.657.B.

More interesting is the MXC connector. This is a push-on connector with a locking tab like the 8P8C connectors used for twisted pair Ethernet. It supports up to 64 fiber strands, each running at 25Gb/s.

MXC connector. Image from Corning-Intel Whitepaper.

The only place I've seen this fiber or connector in use is on a prototype 100G CLR4 transceiver shot by Greg Ferro at the Intel Developer Forum a couple of weeks ago.

Greg's shot of CLR4 transceivers with MXC connectors.

The CLR4 alliance explains that their approach puts four channels running at 25Gb/s each onto a single pair of single mode fiber, and specifically calls for LC connectors on the transceiver, so I'm a little confused about why these transceivers are sporting MXC connectors.

It seems the MXC connector will be used not directly on the transceivers, but for rapid deployment of many strands of structured fiber between racks. There will probably be a fan-out cassette of some sort to turn each MXC-terminated bundle into 32 LC connectors.

Anyway, none of that is the most interesting part of the connector. The most interesting bit is the termination of the optical fibers in the MXC connector. Did you notice that you can actually see the 64 strands (maybe even count them) in that first picture? Consider the following photo of an MPO connector for comparison:

MPO connector from Wikipedia

Most folks don't even notice the individual fiber strands on an MPO connector because they're distracted by the holes for the alignment pins. But you can actually see the 64 ends on the MXC connector. What's going on here?

Lenses:

Still from Intel-Corning video on youtube

The MXC connector includes tiny lenses which expand the light beam at the point where connectors meet. The beam expansion mitigates alignment and contamination issues common to traditional connectors which align fibers face-to-face.

It's pretty neat, though I'm a bit confused about some of the claims made in the various Intel and Corning announcements. The video linked above compares these new fibers to fibers in legacy connectors which have "50um fiber face" and claims that the 180um lens expands the diameter of the mating surface "almost four times"

Clearly they're not talking about single-mode when making these comparisons, but Intel is pushing single-mode based CLR4 transceivers.

Meh. I'm sure this will all become clear when products start shipping. For now I just thought it was interesting to note that these new high density connectors have baked-in lenses.

Wednesday, September 17, 2014

Making better use of libvirt hooks

Libvirtd includes handy hooks for doing management work at various phases in the lifecycle of the libvirt daemon, attached networks, and virtual machines. I've been using these hooks for various things and have found them particularly useful for management of short-lived Linux containers. Some of my use cases for these hooks include:

changing network policy
instantiating named routing tables
creating ramdisks for use by containers
pre-loading data before container startup
archiving interesting data at container shutdown
purging data at container destruction

Here's how the hooks work on a system with RedHat lineage:

The hook scripts live in /etc/libvirt/hooks. The scripts are named according to their purpose. I'm focusing right now on the LXC hook which is named /etc/libvirt/hooks/lxc. Note that neither the directory, nor the scripts exist by default.

The lxc script is called several times in each container's lifecycle, and is passed arguments that specify the libvirt domain id and the lifecycle phase. During startup and shutdown of one of my LXC systems, the script gets called five times, like this:

 /etc/libvirt/hooks/lxc MyAwesomeContainer prepare begin -  
 /etc/libvirt/hooks/lxc MyAwesomeContainer start begin -  
 /etc/libvirt/hooks/lxc MyAwesomeContainer started begin -  
 /etc/libvirt/hooks/lxc MyAwesomeContainer stopped end -  
 /etc/libvirt/hooks/lxc MyAwesomeContainer release end -

In addition to having those command line arguments, each time the script is run, it receives the guest's entire libvirt definition (XML) on STDIN.

My script was parsing the command line arguments to figure out the domain ID and lifecycle phase, then calling countless modules to do different tasks depending on the context. It also read in some external configuration files which specified various parameters for each domain.

The script quickly became an absolute monster. It was doing too much, had too many dependencies, too many modules baked in, and was hard to test without interrupting all of the guest systems.

Two changes to the approach brought this facet of guest management under control:

Change #1: Run-parts Style
The first thing I did was to rip all of the smarts out of the script. All it does now is call other scripts in a manner similar to run-parts. It looks something like this:

 # Collect input from STDIN, stick it in $DATA  
 DATA=$(/bin/cat)  
   
 # Rip the domain ID out of $*  
 DOMAIN=$1; shift  
   
 # Join the remaining command line bits with '_' chars, then add '.d'  
 DIR=${0}_$(/bin/echo "$*" | /bin/sed -e 's/ /_/g' -e 's/$/.d/')  
   
 # Run all numbered-and-executable scripts in DIR with the usual CLI arguments,  
 # passing data collected on our STDIN to the script's STDIN.  
 for script in $DIR/[0-9]* ; do  
  if [ -x $script ] ; then  
   /bin/echo -n "$DATA" | $script $DOMAIN $*  
  fi  
 done

It now weighs in at a svelte 8 lines. It uses its own name ($0 is "lxc" in this case) and the passed command line arguments (not the first one which identifies the domain ID) to divine the name of a directory full of other scripts which need to be run. Then, it runs whatever scripts it finds in that directory. So, when it's called like this:

 /etc/libvirt/hooks/lxc MyAwesomeContainer prepare begin -

It runs all of the scripts it finds in:

 /etc/libvirt/hooks/lxc_prepare_begin_-.d

Those scripts must be executable, and the must be named with a leading digit to specify run order. It's an alphabetic sort, so consider using leading zeros: 11 sorts before 5, but not before 05.

When the new wrapper calls the child scripts, it uses the same command line arguments, and feeds the same data on STDIN. There's an opportunity for things to go weird if libvirt used arguments with spaces in them, but it doesn't do that, so meh.

Now, rather than having a section of one big script that looks like this:

 #!/bin/bash  
 DOMAIN=$1; shift  
 case "$*" in  
     "prepare begin -" )  
         mkdirs_function $DOMAIN
         setup_ramdisk_function $DOMAIN
         preload_data_function $DOMAIN
     ;;  
     "release end -" )  
         archive_data_function $DOMAIN
         purge_data_function $DOMAIN
         destroy_ramdisks_function $DOMAIN
     ;;  
 esac

There is a directory structure that looks like this:

 /etc/libvirt/hooks  
              ├── lxc  
              ├── lxc_prepare_begin_-.d  
              │   ├── 01_mkdirs.py  
              │   ├── 02_setup_ramdisks.py  
              │   └── 03_preload.py  
              └── lxc_release_end_-.d  
                  ├── 01_archive.py  
                  ├── 02_rm-rf.py  
                  └── 03_destroy_ramdisks.py

Each of those scripts operates independently, and has all of the data it needs on the command line and STDIN. They run in order.

If I discover that I need to do something in the "started begin -" lifecycle phase, it's as easy as creating the /etc/libvirt/hooks/lxc_started_begin_-.d directory and dropping some new scripts in there.

Change #2: Libvirt Definition Supports Metadata
The second change is that I began using the libvirt guest definition to specify exactly what work needs to be done by the hook scripts. The various functions (create ramdisk, preload data, etc...) in my old script had been using configuration files, patterns based on the domain ID and so forth. This way is much better.

The libvirt XML schema is quite strict, but it does allow a <metadata> element which will cart around anything you care to define. I use it to specify ramdisk mount points and sizes, pre-load and post-cleanup stuff, etc...

For example, there are some guests which have directories that I don't want to persist after the guest container is destroyed. The libvirt definition for those guests includes the following:

  <metadata>  
    <post-rm-rf>  
      <dir>/path/to/dir/to/remove</dir>  
      <dir>/path/to/otherdir/to/remove</dir>  
    </post-rm-rf>  
  </metadata>

This blob of XML does nothing by itself. Libvirt totally ignores it. The action happens when /etc/libvirt/hooks/lxc_release_end_-.d/02_rm-rf.py gets called.

That guy looks for <dir> elements in <metadata><post-rm-rf>, and removes the specified directories. It's named '02' to ensure that it runs after 01_archive.py, which squirrels away logfiles and other data that I might want to retain.

Because this script receives the libvirt definition XML on STDIN, the information about what to remove travels with the guest. On shutdown of containers that don't include the <post-rm-rf> element, nothing happens.

Here's what's in 02_rm-rf.py:

 #!/usr/bin/env python  

 def xml_from_stdin():  
   import sys  
   import xml.etree.ElementTree as ET  
   tree = ET.parse(sys.stdin)  
   return(tree.getroot())  
   
 def rm-rf(dir):  
   import shutil  
   shutil.rmtree(dir, ignore_errors=True)  
   
 def main():  
   root = xml_from_stdin()  
   for dir in root.find('metadata').findall('post-rm-rf'):  
     rm-rf(dir)  
   
 if __name__ == "__main__":  
   main()

Another example:
The 01_archive.py script is a little more complicated. Here's the relevant XML:

  <metadata>  
    <archive>  
      <dircopy>  
        <src>/path/to/interesting/data</src>  
        <dst>/path/to/interesting/archive/%DATE%/%HOUR%/blahblahblah</dst>  
      </dircopy>  
      <dircopy>  
        <src>/path/to/interesting/logs</src>  
        <dst>/path/to/logarchive/%DATE%/logs</dst>  
      </dircopy>  
      <datestr>  
        <format>%Y-%m-%d</format>  
        <replace>%DATE%</replace>  
      </datestr>  
      <datestr>  
        <format>%H</format>  
        <replace>%HOUR%</replace>  
      </datestr>  
      <datestr>  
        <format>%H:%M:%S</format>  
        <replace>blahblahblah</replace>  
      </datestr>  
    </archive>  
  </metadata>

There are two things going on inside the <archive> element. This first are the <dircopy> bits, which specify data to keep, and where to put it.

The second are the <datestr> bits, which specify strftime format strings, and the related strings in the destination path which should be replaced by an appropriately formatted timestamp.

Now, any <dst> element which includes the string %DATE% will find that string replaced with the current date, formatted like this: 2014-09-17. Similarly, any <dst> element which includes the string blahblahblah will find blahblahblah replaced by a nicely formatted timestamp. There's nothing magic about the percent symbols, the <replace> element can be any string.

The 01_archive.py script is here.

Final Thoughts

My guest management is much more agile with libvirt carrying around extra instructions in <metadata> and the modularity afforded by run-parts style hooks functions.

Each time I want a new thing to happen to a guest, I need to do three things:

Invent some XML describing specifics of the operation, jam it into <metadata>
Write a script which parses the XML, implements the new feature
Drop that script onto all of my libvirt nodes in the appropriate lifecycle_args.d directory

One gotcha which has tripped me up more than once: libvirtd might not notice that the hooks script has been installed (or edited?) until libvirtd gets restarted. Maybe I'll remember this tidbit after having blogged about it. [Edit: Nope. I was just surprised to read this section from an earlier draft.]

Saturday, September 13, 2014

No more duplicate frames with Gigamon Visibility Fabric

disclaimer

Gigamon presented their Visibility Fabric Architecture at Network Field Day 8. You can watch the presentation at Tech Field Day.

CC BY-NC-ND 2.0 licensed photo by smif
Needs deduplication

One of the interesting facets of Gigamon's solution was it's ability to do real-time de-duplication of captured traffic as it traverses the Visibility Fabric (a hierarchy of monitoring data sources and advanced aggregation switches). I've spent some time around proactively-deployed network taps, but never seen this capability before, and I think it's pretty nifty.

The Problem
Let's say you've got taps and mirror ports deployed throughout your network. They're on the uplinks from data center access switches, virtually attached to vSwitches for collecting intra-host VM traffic, at the WAN and Internet edge, on the User distribution tier, etc... All of these capture points feed various analysis tools via Gigamon's Visibility Fabric. It's likely that a given flow will be captured and fed into the monitoring infrastructure at more several points. Simplistic capture-port-to-tool-port forwarding rules will result in a given packet being delivered to each interested tool more than once, possibly several times.

This can be problematic because it confuses the analysis tool (ZOMG, look at all of those TCP retransmissions!), burns precious bandwidth on links within both the tap aggregation infrastructure and to the analysis tool, skews timestamps by introducing link congestion (not a problem for analyzers which leverage Gigamon's timestamp trailer) and wastes disk space when traffic is being written to disk.

Gigamon's Answer
When configured to suppress duplicates, Gigamon equipment will remember (for 50ms, by default) traffic it's sent to a given tool, and suppress delivery of duplicate copies collected in other parts of the infrastructure. The frames don't have to be exactly the same: A packet routed from one subnet to another will have a different Ethernet header. Apparently this is okay, and doesn't confuse the de-duplication filters. 802.1Q tags are probably okay too, though I think I heard that recognizing a VXLAN-encapsulated frame as a duplicate of a native Ethernet frame is beyond the capability of the de-dupe feature.

Even though Gigamon bufferes traffic for 50ms, there's no added latency introduced when suppressing duplicates. It's not like a jitter buffer. Rather, the first copy of a given packet is always delivered immediately. The buffer is used only to recognize duplicates which need to be suppressed.

Should you de-duplicate the traffic? Maybe. It depends on the tool and on the problem you and the tool are attacking.

Tools focused on application-layer analysis (IDS/IPS, DLP, user experience monitoring, etc...) probably don't generally need to see the same traffic as it appears at various points in the network and will benefit from de-duplication, while tools focused on infrastructure performance monitoring and post-incident analysis will want to see everything, and are likely deployed in such a way that they're already set up to differentiate between multiple copies of the same captured data.

Disclaimer
I was a delegate and Gigamon was a presenter at NFD8. Gigamon paid some of the costs associated with putting on the event, and Gestalt Media covered the costs associated with getting me there, feeding me, etc... Also, I collected some vendor tchotchkes along the way. Neither Gigamon nor Gestalt IT bought my opinions, nor space on this blog. I writing about them because I found their presentation interesting, and for no other reason.

Tuesday, August 12, 2014

Cisco 881 or Cisco 881?

There are two versions of the Cisco 881 branch router:

Part numbers beginning with CISCO881, which have been end of lifed.
Part numbers beginning with C881, which are newly available.

There are a bunch of differences between these models, but it's hard to tell that a difference even exists, let alone what the differences are by looking at the available documentation. I just got my hands on a new C881 for the first time. Here's what I've noticed.

Physical Differences

New C881 on top, old CISCO881 (not wireless - don't believe the stickers) on bottom.

New C881 on top

Twin screw holes on the new C881...

...make the ACS-890-RM-19= work on the C881.

The USB port has been moved from one side to the other.
The "fake" screw hole on the side is now a threaded hole, which means that the C881 will accept the 891's rack mount hardware.
The Fa4 port has moved a bit.
The C881 is lead free, which seems to be what prompted all of these gyrations.

Power Differences

We have a power switch!
There's no longer a dedicated PoE brick.
There's still a required internal PoE module, and it's got a different part number.

Licensing Differences

C881 ships with 1GB RAM, but half of it is crippled by a license.
C881 running c800-universal software image doesn't enforce the advanced IP services license, while a CISCO881 running c880-universal software image does enforce that license.

 ! CISCO881-K9 with PoE option running c880data-universalk9-mz.154-3.M.bin  
 CISCO881#show inventory  
 NAME: "881", DESCR: "881 chassis, Hw Serial#: xxxxxxxxxxx, Hw Revision: 1.0"  
 PID: CISCO881-K9    , VID: V01 , SN: xxxxxxxxxxx  
   
 NAME: "ESW Power Daughter Card", DESCR: "4-Port ESW Power Daughter Card"  
 PID: ILPM-4      , VID: V01 , SN: xxxxxxxxxxx  
   
 CISCO881#show license feature   
 Feature name       Enforcement Evaluation Subscription  Enabled RightToUse  
 advipservices      yes     yes     no       no    yes  
 advsecurity        no      no      no       yes   no  
 ios-ips-update     yes     yes     yes      no    yes

 ! C881-K9 without PoE option running c800-universalk9-mz.SPA.154-3.M.bin  
 C881#show inventory  
 NAME: "C881-K9", DESCR: "C881-K9 chassis, Hw Serial#: xxxxxxxxxxx, Hw Revision: 1.0"  
 PID: C881-K9      , VID: V01, SN: xxxxxxxxxxx  
   
 NAME: "C881 Mother board on Slot 0", DESCR: "C881 Mother board"  
 PID: C881-K9      , VID: V01, SN: xxxxxxxxxxx  
   
 C881#show license feature  
 Feature name       Enforcement Evaluation Subscription  Enabled RightToUse  
 advipservices      no      yes     no       no    yes  
 advsecurity        no      no      no       no    no  
 ios-ips-update     yes     yes     yes      no    yes  
 MEM-8XX-512U1GB    yes     yes     no       no    yes

The old 881 won't let me configure BGP, multicast, VRFs without the advipservices license, but the new 881 (or the c800-universal software build) is currently running those features in the lab without any problem.

ASR1002 does something similar: the IPSec RTU license on that box lists for US $10,000, but it's not enforced. I don't understand why they've done this.

I've not yet tried loading the c800-universal image on an old 881.

Other stuff

The C881 comes with twice as much flash (256MB rather than 128MB) as the CISCO881 - It's a good thing because the c800-universal software image is approximately twice the size (80MB) of the c880-universal image.
The C881 isn't available bundled with the advipservices license, which would be an annoying hassle if the license were enforced.

Update

Requested photos:

C881

CISCO881

Update2
Matt shares a shot of his C887VAM-K9 for comparison. Lots of missing parts on the C881 appear here.

C887VAM-K9

Wednesday, June 25, 2014

Wireshark on Android

The other day I found myself wishing I could run wireshark in realtime on an Android phone, but use the familiar GUI on my laptop. After a few minutes tinkering around, I was doing exactly that.

The phone belonged to an Android developer, so he'd already rooted it, enabled developer tools, etc... He'd also installed a packet capture application which worked, but didn't allow me to see things in real time.

The Android SDK bundle contains the adb binary, which is required for connecting to the phone. Extract adb and drop it somewhere in $PATH

 # run adb as root:  
 adb root  
 # connect to the phone over WiFi (the phone's owner had
 # already enabled this feature with 'adb tcpip' via USB):  
 adb connect <phone's wifi ip address>  
 # check that we get a root shell on the phone:  
 adb shell 'id'

It turns out that the packet capture application included a tcpdump binary at /data/data/lv.n3o.shark/files/tcpdump, and invoking it from the adb shell worked normally. It produced the usual startup message, and then a one line summary of each packet.

 adb shell '/data/data/lv.n3o.shark/files/tcpdump -c 2'
 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
 listening on wlan0, link-type EN10MB (Ethernet), capture size 96 bytes
 12:37:04.053553 IP 192.168.1.8.rplay > 192.168.1.22.57182: P 2817938036:2817938134(98) ack 1887364010 win 1358 
 12:37:04.054244 IP 192.168.1.8.42742 > google-public-dns-a.google.com.domain: 12160+ PTR? 8.1.168.192.in-addr.arpa. (44)

Progress, but not wireshark. I expected that adding the '-w -' flag (write pcap data to STDOUT) to tcpdump would allow me to collect pcap data from adb's STDOUT on my macbook, and feed it into wireshark, but didn't give the result I wanted. There were two problems:

Tcpdump's informational messages got mingled with the pcap data when the data came out of adb shell. This was a problem.
Binary data pipelined through the adb shell got corrupted. The adb shell is a tty of some flavor, I guess, and this is not a safe way to pass binary data.

I solved the first problem by trashing tcpdump's informational messages. These go to STDERR on the phone, but were mingled with STDOUT when they reached the shell on the macbook. '2>/dev/null' (trash stderr) got rid of problem #1.

Problem #2 was resolved by base64 encoding the pcap data on the phone, before handing it to adb shell, and then removing the base64 encoding on the macbook. Fortunately there's a base64 binary in the root user's path on this Android phone.

The bits that go into the final solution are:

adb shell 'command' (run command on the phone)
tcpdump -w - 2>/dev/null | base64 (run tcpdump, trash STDERR, encode the output)
base64 -D (undo the base64 encoding - this runs on the mac)
wireshark -ki - (wireshark '-k' means start capture immediately, '-i -' means read from stdin)

Putting it all together we have:

 adb shell '/data/data/lv.n3o.shark/files/tcpdump -w - 2>/dev/null | base64' | base64 -D | wireshark -ki -

Job done. The command above connects to the remote Android phone, fires up tcpdump on the phone, fires up wireshark on my laptop, and bolts the two together, making it possible to work with packets (as seen by the phone) in real time on a laptop.

It's not 100% real time, there's some batching and latency, but it was okay for my purposes. Adding tcpdump's '-U' flag might help here.

Tuesday, June 3, 2014

Relaying email with postfix + TLS through gmail

I needed to relay email from appliances in my house, and wanted to use my gmail domain + TLS to do it. Following are my notes from setting up a postfix server to do that job. All email relayed by this server appear to be sourced from the gmail account I created for it.

I wouldn't use this for anything customer-facing, but it's a reasonable way to get messages out of closed environments without worrying about how the messages were sourced, who they appear to be from, will SPF records screw things up, etc...

Create gmail account
I'm using an account named postfix-relay@marget.com. I set that guy up, and gave him a password.

Install Linux somewhere
I'm using a minimal installation of CentOS 6.5 for this project, installed with some automated nonsense I've long used for this sort of thing.

Tweak hostname

 sed -i 's/localhost.localdomain/postfix-relay.marget.com/' /etc/sysconfig/network

NFS mount my CentOS repository
The next little bit uses automounter to hang my CentOS repository on /CentOS and configure it as a repository. Skip it.

 yum install -y nfs-utils wget tcpdump unzip autofs  
 service rpcbind start  
 service autofs restart  
 ln -s /net/my_nfs_server/path/to/CentOS/ /CentOS  
 cp /etc/yum.repos.d/CentOS-Media.repo /etc/yum.repos.d/CentOS-NFS.repo  
 sed -i -e 's/\/media//' /etc/yum.repos.d/CentOS-NFS.repo  
 sed -i -e 's/c6-media/c6-nfs/' /etc/yum.repos.d/CentOS-NFS.repo

Update and install packages

 yum update -y  
 yum install -y ntp ntpdate wget unzip openssl-perl setools-console  
 yum install -y telnet cyrus-sasl-md5 cyrus-sasl-plain

Install and configure a user, sudo and ssh.

 useradd chris -u <myUID>  
 groupmems -g wheel -a chris  
 sed -i 's/# \(\%wheel.*NOPASSWD.*$\)/\1/' /etc/sudoers  
 (echo;echo) | su -c 'ssh-keygen -t rsa' chris  
 wget -O ~chris/.ssh/authorized_keys http://server.where/i_keep_my_public_key.pub  
 chmod 600 ~chris/.ssh/authorized_keys  
 chown chris:chris ~chris/.ssh/authorized_keys  
 sed -i 's/^#*PermitRootLogin.*$/PermitRootLogin no/' /etc/ssh/sshd_config  
 service sshd restart

NTP
The first couple of lines here edit the init script so that service ntpd status dumps peer synchronization info rather than just ntpd is running.

 export NTPQ='[ -x /usr/sbin/ntpq ] \&\& /usr/sbin/ntpq -c peers'
 sed -i "s|[[:space:]]status \$prog$|&\n\t$NTPQ|" /etc/init.d/ntpd
 chkconfig ntpdate on
 chkconfig ntpd on
 service ntpdate start
 service ntpd start
 sleep 5; service ntpd status

Certificate Authority
Postfix needs a certificate for this TLS business to work, but it doesn't need to be from a CA that gmail recognizes. I'm creating a CA just to sign one certificate (well, two: the CA's own certificate, and the certificate postfix will be using). The CA protects its key with a PEM passphrase. I like to use openssl rand -base64 40 to create PEM passphrases. The passphrase will be needed only 4 times (creation/confirmation/CA cert signing/postfix cert signing). Because this is a single-use CA, it's okay to forget the passphrase after we're done here. If you make a mistake in this section and want to start over with the CA stuff, just rm -rf /etc/pki/CA and start over.

The 'challenge password' doesn't matter. These exist so that a certificate holder may prove her identity when requesting revocation of a lost certificate. They prevent spoofing of revocation requests, and don't make sense as far as I'm aware in the context of a root CA certificate.

 # /etc/pki/tls/misc/CA.pl -newca  
 CA certificate filename (or enter to create)  <enter>
   
 Making CA certificate ...  
 Generating a 2048 bit RSA private key  
 ........+++  
 ....+++  
 writing new private key to '/etc/pki/CA/private/cakey.pem'  
 Enter PEM pass phrase:  ThisIsMyCApassphrase
 Verifying - Enter PEM pass phrase:   ThisIsMyCApassphrase 
 -----  
 You are about to be asked to enter information that will be incorporated  
 into your certificate request.  
 What you are about to enter is what is called a Distinguished Name or a DN.  
 There are quite a few fields but you can leave some blank  
 For some fields there will be a default value,  
 If you enter '.', the field will be left blank.  
 -----  
 Country Name (2 letter code) [XX]:US  
 State or Province Name (full name) []:New Hampshire  
 Locality Name (eg, city) [Default City]:Brookline  
 Organization Name (eg, company) [Default Company Ltd]:Marget Labs  
 Organizational Unit Name (eg, section) []:Basement  
 Common Name (eg, your name or your server's hostname) []:postfix  
 Email Address []:postfix-relay-admin@marget.com  
   
 Please enter the following 'extra' attributes  
 to be sent with your certificate request  
 A challenge password []: blahblahblah  
 An optional company name []:  <enter>
 Using configuration from /etc/pki/tls/openssl.cnf  
 Enter pass phrase for /etc/pki/CA/private/cakey.pem:   ThisIsMyCApassphrase 
 Check that the request matches the signature  
 Signature ok  
 Certificate Details:  
 <snip>

Display the CA Certificate
Not necessary, just taking a look. /etc/pki/CA/cacert.pem is the CA certificate.

 openssl x509 -in /etc/pki/CA/cacert.pem -noout -text

Copy the CA certificate into the postfix directory

 cp /etc/pki/CA/cacert.pem /etc/pki/tls/certs/POSTFIX-TRUST.pem

Create a Certificate Signing Request (CSR)

Input to the CSR creation is the subject name (string below) and a public/private pair. This command creates the keys and the CSR. The keypair (both parts) will be stored in POSTFIX-key.pem, and the CSR will be stored in POSTFIX-req.pem

openssl req -new -nodes -subj '/CN=postfix/O=Marget Labs/C=US/ST=New Hampshire/L=Brookline/emailAddress=postfix@home.marget.com' -keyout POSTFIX-key.pem -out POSTFIX-req.pem

Display the CSR

Just because we're curious.

openssl req -text -noout -in POSTFIX-req.pem

Sign the Certificate

The CA, using his private key, which is protected with the PEM passphrase, signs the CSR and creates the certificate postfix will be using. This command will spit out some details from the CSR (what are we being asked to sign?) and prompt for the CA's PEM passphrase (4th use of the passphrase here). The new certificate will land in POSTFIX-cert.pem

openssl ca -days 1093 -out POSTFIX-cert.pem -infiles POSTFIX-req.pem

Display the Certificate

Just because we're curious.

openssl x509 -in POSTFIX-cert.pem -noout -text

Delete the CSR

Don't need it anymore.

rm -f POSTFIX-req.pem

Set permissions on the keys and certificate, move them into place.

 chmod 400 POSTFIX-key.pem  
 chmod 644 POSTFIX-cert.pem  
 mv POSTFIX-key.pem /etc/pki/tls/private  
 mv POSTFIX-cert.pem /etc/pki/tls/certs

Another trusted CA
Google will also be presenting a certificate. In order to trust google, we need to trust the CA that signed their certificate. Trusting that CA suggests we've seen the CA's certificate and public key. So far, we have not. Get Thawte's CA certificate (with public key inside) and append it to our existing file with trusted CA certificates.

 wget https://www.thawte.com/roots/thawte_Premium_Server_CA.pem  
 cat thawte_Premium_Server_CA.pem >> /etc/pki/tls/certs/POSTFIX-TRUST.pem  
 rm -f thawte_Premium_Server_CA.pem

What have we accomplished so far?
Surprisingly little, actually. Three files have been contributed to the postfix environment:

POSTFIX-TRUST.pem < root certificates belonging to both trusted CAs (ours and thawte's)
POSTFIX-key.pem < our private key
POSTFIX-cert.pem < our signed certificate

But, those three files make us ready to...

Configure postfix main.cf

This is the first of three postfix config files needing attention. In this file we're making use of the 3 files listed above. Some other bits of interest:

The mynetworks directive specifies the list of prefixes from which postfix will accept mail for relaying.
The inet_interfaces directive specifies the interfaces on which postfix will listen for mail. It might be appropriate to be specific here if the box is multihomed.
The smtp_sasl_password_maps directive specifies the file where we keep the password lookup table (not the plain text file - note the postmap command below.

 cat >> /etc/postfix/main.cf << EOF  
 relayhost = [smtp.gmail.com]:587  
 smtp_use_tls = yes  
 smtp_tls_CAfile = /etc/pki/tls/certs/POSTFIX-TRUST.pem  
 smtp_tls_cert_file = /etc/pki/tls/certs/POSTFIX-cert.pem  
 smtp_tls_key_file = /etc/pki/tls/private/POSTFIX-key.pem  
 smtp_tls_session_cache_database = btree:/var/run/smtp_tls_session_cache  
 smtp_tls_security_level = secure  
 smtp_tls_mandatory_protocols = TLSv1  
 smtp_tls_mandatory_ciphers = high  
 smtp_tls_secure_cert_match = nexthop  
 tls_random_source = dev:/dev/urandom  
 smtp_sasl_auth_enable = yes  
 smtp_sasl_security_options = noanonymous  
 smtp_sasl_password_maps = hash:/etc/postfix/sasl_passwd  
 inet_interfaces = all  
 mynetworks = 10.0.0.0/8 172.16.0.0/12 192.168.0.0/16 127.0.0.0/8 [::1]/128  
 EOF

Configure postfix transport file
This file directs postfix about how to deliver certain kinds of messages. In this case we want to send everything to gmail's message submission engine.

 cat >> /etc/postfix/transport << EOF  
 * smtp:smtp.gmail.com:587  
 EOF

Passwords!
In order to log into gmail, postfix needs to know the password for its account. We enter the password in /etc/postfix/sasl_password, and the postmap command writes it to the lookup table in /etc/postfix/sasl_password.db. Re-run postmap whenever the sasl_password file is edited.

 cat > /etc/postfix/sasl_passwd << EOF  
 [smtp.gmail.com]:587 postfix-relay@marget.com:MyGMailPassword  
 EOF  
   
 chown root:postfix /etc/postfix/sasl_passwd  
 chmod 640 /etc/postfix/sasl_passwd  
 postmap /etc/postfix/sasl_passwd

Done!
postfix reload will handle most postfix reconfiguration without restarting the whole process. Oh, is postfix even running? I don't know. Heck, lets just do everything:

 postfix reload  
 chkconfig postfix on  
 /etc/init.d/postfix restart

At this point, postfix should be listening on port 25, and deliver messages to their destination via gmail. And the first bit (sending to gmail) will be secure.

Thursday, May 29, 2014

SDN: Where Everything is a Honeypot

Beware the honeypot army!

HP Networking introduced one of their SDN App Store partners to the Tech Field Day crowd at the ONUG spring conference a few weeks ago. If you don't know about ONUG, but you're interested in real-world SDN options and operator experience free of vendor lies, you should probably check out the upcoming fall ONUG conference.¹

GuardiCore's Active Honeypot SDN offering really captured my imagination in ways that other SDN demonstrations have failed to do.

The objective is to detect/ensnare an intruder who has already compromised an asset in your datacenter and is now attempting to move on from there. Honeypots are one way of doing this, but the likelihood of an attacker finding the honeypot, rather than a real server with real vulnerabilities is pretty low in a large data center. How can we improve the odds?

The solution assumes that during normal operations, clients know where servers are and don't waste time attempting to connect to services which don't exist. An attacker, on the other hand, will be looking to find vulnerable services, and will probably attempt lots of connections to services that don't exist.

Because the attacker doesn't know where she's going to find the next service to exploit, her attempts will generate lots of TCP segments with the RST (go away!) bit set, something we don't expect²from normal application traffic.

So, lots of RST segments will be flowing toward the attacker from (potentially) all over the network. GuardiCore inserts forwarding rules into physical and/or virtual switches to intercept RST segments and redirect them to a GuardiCore analyzer. Both the interception and the analysis can be widely distributed, possibly into every hypervisor, so that the RST segments might never even hit the real network. This means we can implement the solution on a legacy DC network or on a network full of physical servers (with OpenFlow capable switches).

When the GuardiCore engine receives an RST segment it's got 3 options:

Do nothing. The RST will never be delivered to the attacker, making it look like a filtered port on the server.
Put the RST back onto the wire for delivery to the attacker. She'll never know anything funny happened in this case.
Do magic things I'll describe below.

GuardiCore knows where your honeypots are, and what "services" each is offering. When it concludes that one of your systems appears to be compromised because it is scanning for vulnerable services, three things happen:

A NAT rule is inserted into the network near the attacker. The rule rewrites packets belonging to the attempted flow between the attacker and a real system (the one which generated the RST) into a flow between the attacker and a honeypot offering the service the attacker is seeking.
A SYN packet is delivered to the honeypot on the attacker's behalf.
The honeypot responds to the attacker with SYN/ACK. The NAT rule makes the packet appear to have come from the system which the attacker attempted to contact. The attacker thinks she's found a system to attack.

This is really nifty. The attacker is identified by collecting RST segments from everywhere (a great idea), and then is contained by transforming any/every IP address which the attacker might attempt to connect into a honeypot.

Rather than waiting for the attacker to stumble onto the honeypot, GuardiCore makes sure that the honeypot finds the attacker.

Clever stuff.

1 Full disclosure: I think ONUG is great, but one of my pals is a bigwig there, so believe nothing I say about it. ^↩
2 Unfortunately badly coded applications with sloppy socket handling generate lots of RSTs too. There's no question that GuardiCore will find these, but I imagine they've got some filtering logic to ignore these cases.^↩

Monday, March 31, 2014

Custom DHCP server configuration on Opengear

For the last few years I've run my home DHCP service on a virtual private server at AWS.

This was not a great idea. It's a pain to resolve issues with my Internet service when those issues cause my laptop to stop getting an IP address assigned because the path between my house and the DHCP server has been interrupted.

The service is at Amazon because I wanted to purge "server" like things from the house, but its clear that I needed to bring DHCPd back home. I started investigating moving the service to one of my Opengear ACM5000 units, which is always running anyway because it keeps tabs on my generator and home security system, sends me text messages about interesting events and whatnot.

The Opengear web UI doesn't offer too many DHCP service configuration options, but I didn't expect that to be a problem. One of the things I love about Opengear is that most anywhere you look, the baked-in scripts and configuration elements can be replaced with user-supplied versions of those things.

I'd expected to find something like include /etc/config/dhcpd-user.conf in the automagically-generated DHCPd configuration, but it wasn't there.

So, I strings-ed every file in the devkit, and found this gem:

chris@opengeardev:/usr/local/src$ strings OpenGear-ACM500x-devkit-20140225/lib/libconfig.so | grep dhcpd-custom

/etc/config/dhcpd-custom.conf

It isn't documented anywhere (that I could find), so I pinged @opengeardev on twitter to ask about it.

3 minutes later, @opengeardev replied, having reviewed the code.

I've tried reaching other vendors in this space about real problems
on real (paid-for) products and gotten nowhere fast, even on their
(paid-for) support lines. Memo to JGV, ZEI and Ynagebavk (ROT13):
Try to be more like Opengear.

All I needed to do was:

Create my intended dhcp server configuration as /etc/config/dhcpd-custom.conf
Enable the service with the checkbox in the web UI
Maybe run the configurator by hand with config -r dhcp (probably hitting 'apply' after ticking the box does this anyway)

My new DHCP server configuration is now enshrined in the configuration bundle (backups from the web UI will save it), and the service is running with my particular configuration requirements, but without me having to butcher the service to get it running this way.

# ps | grep dhcpd     

 2993 root      3632 S    /bin/dhcpd -f -cf /etc/config/dhcpd-custom.conf -lf /

 3047 root      1228 S    grep dhcpd 

#

I love Opengear. Not only did they anticipate my requirement, but they're super-responsive to customer queries. Or, I assume they are. I'm not really a customer. The boxes I own were freebies, but those gifts haven't influenced my opinions about the company. I love Opengear because their products and their people are the best, not because they gave me some hardware to play with.

Tuesday, February 4, 2014

Tab Completion on Cumulus Linux

This film could have ended much differently

if Jerry were running Cumulus Linux

The TAB key on my keyboard gets a lot of use. Whether I'm looking at a bash prompt on a *NIX system or logged into a router's CLI, I almost never type whole commands.

In the bash shell, tab completion capabilities are usually limited to helping complete:

shell built-in commands
external executables found in $PATH
file names
directory names

Completion in bash doesn't help with things like command line arguments to various commands, but it is (sometimes) smart enough to not offer filenames as completion options to the 'cd' command, choosing instead to only offer directories.

Network devices, on the other hand, tend to have really rich inline help / command completion stuff, and I live by it.

Rather than typing abbreviated commands, I prefer to let the system help me type the whole thing, partly because it eliminates errors, and partly because I usually can't remember the exact syntax for a given platform. Cisco's godawful platform-dependent mac-address-table vs. mac address-table comes immediately to mind as something that always seems to take more than one attempt.

So, rather than typing this:

ROUTER#sh ip bg vpnv4 vr GREEN nei A.B.C.D received-

I tend to do this:

ROUTER#sh<TAB>
ROUTER#show ip bg<TAB>
ROUTER#show ip bgp vp<TAB>
ROUTER#show ip bgp vpnv4<TAB>
ROUTER#show ip bgp vpnv4 vr<TAB>
ROUTER#show ip bgp vpnv4 vrf GREEN nei<TAB>
ROUTER#show ip bgp vpnv4 vrf GREEN neighbors A.B.C.D re<TAB>
ROUTER#show ip bgp vpnv4 vrf GREEN neighbors A.B.C.D received-<TAB>
ROUTER#show ip bgp vpnv4 vrf GREEN neighbors A.B.C.D received-routes

This is pretty helpful, and an improvement over the bash shell where you often must abandon the command line for man pages in order to figure out the options for a given command.

Imagine my surprise when I found myself using pretty-full-featured tab completion in Cumulus Linux!

For example, when I started up the routing services for the first time, I instinctively did this:

#serv<TAB>
#service qu<TAB>
#service quagga sta<TAB>
#service quagga start

Checking the CDP/LLDP neighbors table also supports completion:

#lldp<TAB>
#lldpcli nei<TAB>
#lldpcli neighbor sh<TAB>
#lldpcli neighbor show

If I wasn't sure about an argument, or if more than one argument matched, a second press of <TAB> offered the whole list of possibilities in the way you'd expect from bash, similar to the '?' on Cisco IOS.

This is really nice, and so seamlessly integrated that I didn't even notice I was using it at first. Then I noticed, and was mystified: What is this wizardry?

Well, here's how it works.

Suppose we've got an executable mycommand, which understands two arguments "-A" and "-B".

In order to get bash completion for mycommand, we need to drop a bash completion script in /etc/bash_completion.d/ (a directory I'd never even noticed before.) The script should get sourced by your shell at startup time, and it does two things:

It loads a module into the shell.
It associates the module with the command in question for completion purposes.

So, here's a simple example, for mycommand.

[chris@poetaster]$ cat /etc/bash_completion.d/mycommand.bash
_mycommand()
{
COMPREPLY=( $(compgen -W "-A -B" -- ${cur}) )
return 0
}
complete -F _mycommand mycommand
[chris@poetaster]$

Now, we're rewarded with the following:

[chris@poetaster]$ myc<TAB>
[chris@poetaster]$ mycommand <TAB>
[chris@poetaster]$ mycommand -<TAB>
<audible bell>
[chris@poetaster]$ mycommand -<TAB>
-A -B
[chris@poetaster]$ mycommand -A

Unlike Cisco IOS, the bash prompt in Cumulus Linux executes completion without printing extraneous lines, which I appreciate.

Of course, the completion routines baked into CumulusLinux are much more complicated than my simple "-A -B" example here. They understand the context of the command, where files are required vs. switches, when one option makes another obsolete, interface names, ip addresses, etc...

The completion feature wasn't mentioned in any of the documentation I read. It's a small thing, and they just did it without making a big deal. But it's really helpful, and I think it will help make lots of network folks much more comfortable in the Linux shell than they might otherwise have been.

Also, I'm glad they did it because I had no idea this existed, and I'm already thinking about ways to bake it into stuff I work on every day. Particularly virsh. I want command completion to help me type the names of VMs.

Monday, February 3, 2014

SDN Themes from ONUG - Let the ASIC go

Edit: I banged this out on the flight home from ONUG four months ago. Just found it in the drafts folder. ONUG's spring 2014 conference in New York is just 3 months away.

I was privileged to attend the Open Networking User Group (ONUG) Conference, ONUG Academy and mini Tech Field Day event hosted by JP Morgan Chase on October 29 and 30.
I attended at someone else's expense. Disclaimer.

ASICs came up a lot during these couple of days. Following are some ASIC-related things I heard and overheard at ONUG.

Sun Microsystems was overly attached to their SPARC processor (and so was I!) Folks inside Sun made efforts to derail Solaris x86, in order to protect their favorite server platform, and contributed to killing the company altogether. Sad story.

As good as your ASIC is, you'll never keep up with the performance of commodity chipsets. If the whitebox stuff is faster and still good enough to do the job, then it's probably going to win. It's certainly going to cost less. The proprietary ASIC may be better and have more features, but better is the enemy of good enough.

People used to route packets using general-purpose servers with multiple NICs. Cisco invented the networking market by introducing purpose-built equipment for moving packets between networks. The pendulum is swinging the other way now, as network functions become virtualized into general purpose servers, commodity network hardware and whitebox switching.

Lately (starting with the Nexus 3K) even Cisco is shipping switches with non-Cisco switching ASICs. This really says something about the state of the industry.

Assuming (as I do) that SDN overlays are the way of the future, they need to run on an physical underlay which don't require most of the features we're accustomed to seeing from major vendors data center offerings.

My friend Brent Salisbury is fond of saying that "OVSDB is the forwarding plane of SDN."

If Brent is right, then commodity switching ASICs are the underlay of SDN, because it would be ridiculous to pay for fancy feature support twice. Right now, that means everyone is talking about Broadcom's Trident II but Mellanox, Intel or others could launch a chip that changes the discussion that at any time.

All we need from the underlay is a simple IP fabric, and I can't see any reason to care where it comes from. More interesting features are the price and the interface to configuring the switch / programming the ASIC.