Wednesday, March 30, 2022

Set git behavior based on the repository path

I maintain a handful of git accounts at GitHub.com and on private git servers, and have repeated committed to a project using the wrong personality.

My early attempts to avoid this mistake involved scripts to set per-project git parameters, but I've found a more streamlined option.

The approach revolves around the file hierarchy in my home directory: Rather than dumping everything in a single ~/projects directory, they're now in ~/projects/personal, ~/projects/work, etc...

Whenever cloning a new project, or starting a new one, as long as I put it in the appropriate directory, git will chose the behaviors and identity appropriate for that project.

Here's how it works, with 'personal' and 'work' accounts at GitHub.com

1. Generate an SSH key for each account

Not strictly required, I guess, but I like the privacy-preserving angle of using different keys everywhere, so I do this as a matter of habit.
 ssh-keygen -t ed25519 -P '' -f ~/.ssh/work.github.com  
 ssh-keygen -t ed25519 -P '' -f ~/.ssh/personal.github.com  

2. Add each public key to its respective GitHub account.

Use ~/.ssh/work.github.com.pub and ~/.ssh/personal.github.com.pub (note the .pub suffix).

Instructions here.

3. Create the project directories and .gitconfig files

 mkdir -p ~/projects/{personal,work}  
 touch ~/projects/{personal,work}/.gitconfig  

4. Set up a .gitconfig file in each directory

The git behaviors which should be differentiated by the parent folder belong in that folder's .gitconfig file. My ~/projects/work/.gitconfig for example:
 [user]  
  email = chris@work.domain  
 [core]  
  sshCommand = "ssh -i ~/.ssh/work.github.com"  
Because I only use a single git service/server with each directory/domain (personal/work), I opted to select my private key using the -i <filename> option to the ssh command. In a more complicated scenario, it might make sense to use -F <filename> to select an alternate ssh configuration file and specify keys per host there.

5. Conditionally include project gitconfig from main gitconfig

My main .gitconfig looks something like this:

 [init]  
  defaultBranch = main  
   
 [user]  
  name = Chris Marget  
   
 [core]  
  excludesFile = ~/.gitignore  
   
 [includeIf "gitdir:~/projects/work/"]  
  path = ~/projects/work/.gitconfig  
   
 [includeIf "gitdir:~/projects/personal/"]  
  path = ~/projects/personal/.gitconfig  

That's it!

Any work done in ~/projects/personal/somerepo will automatically use the git behavior specified by ~/projects/personal/.gitconfig.

I wasn't confident that the gitdir condition would match when cloning an existing repo (the directory doesn't exist yet!) but it works the way I'd hoped. clone/commit/push operations all now use the intended directory-specific configuration options.

Friday, March 11, 2022

Dell 2161DS-2 serial port pinout

I picked up a Dell (Avocent) 2161DS-2 (same as 4161DS?) KVM recently, and needed to use the serial port to upgrade the software.

Naturally, the serial port pinout is non-standard and requires a proprietary cable which comes with the KVM. Dell part numbers 80DH7 and 3JY78 might be involved. I don't have, and have never seen these cables.

I was able to to find the RX, TX and Ground pins and interact with the system using 9600, 8, N, 1.

Pinout in red text

Is the color coding inside these adaptors standardized? If so this may help.

The system prints some unsolicited messages ("welcome" or somesuch) a little while after power-up.

Notes from upgrading the firmware from MacOS 12:


 # Grab the firmware  
 URL="https://dl.dell.com/RACK SOLUTIONS/DELL_MULTI-DEVICE_A04_R301142.exe"  
 wget -P /tmp "$URL"  
   
 # Start MacOS tftp service  
 sudo launchctl load -w /System/Library/LaunchDaemons/tftp.plist  
   
 # Extract the firmware (it's a self-extracting exe, but we can open it with unzip)  
 sudo unzip -d /private/tftpboot "/tmp/$(basename "$URL")" Omega_DELL_1.3.51.0.fl  
   
 # Now, using the menu on the KVM serial port, point it toward the MacOS TFTP service  
 # to retrieve the Omega_DELL_1.3.51.0.fl file  

Tuesday, March 2, 2021

VPC peering with terraform

I wrote up an example of using terraform modules and provider aliases to turn up interconnected cloud resources (vpcs / vpc peering / peering acceptance / routes) across multiple cloud regions.




Check it out at GitHub.

Saturday, December 21, 2019

How do RFC3161 timestamps work?

RFC3161 exists to demonstrate that a particular piece of information existed at a certain time, by relying on a timestamp attestation from a trusted 3rd party. It's the cryptographic analog of relying on the date found on a postmark or a notary public's stamp.

How does it work? Let's timestamp some data and rip things apart as we go.

First, we'll create a document and have a brief look at it. The document will be one million bytes of random data:

 $ dd if=/dev/urandom of=data bs=1000000 count=1  
 1+0 records in  
 1+0 records out  
 1000000 bytes transferred in 0.039391 secs (25386637 bytes/sec)  
 $ ls -l data  
 -rw-r--r-- 1 chris staff 1000000 Dec 21 14:10 data  
 $ shasum data  
 3de9de784b327c5ecec656bfbdcfc726d0f62137 data  
 $   

Next, we'll create a timestamp request based on that data. The -cert option asks the timestamp authority (TSA) to include their identity (certificate chain) in their reply and -no_nonce omits anti-replay protection from the request. Without specifying that option we'd include a large random number in the request.

 $ openssl ts -query -cert -no_nonce < data | hexdump -C  
 Using configuration from /opt/local/etc/openssl/openssl.cnf  
 00000000 30 29 02 01 01 30 21 30 09 06 05 2b 0e 03 02 1a |0)...0!0...+....|  
 00000010 05 00 04 14 3d e9 de 78 4b 32 7c 5e ce c6 56 bf |....=..xK2|^..V.|  
 00000020 bd cf c7 26 d0 f6 21 37 01 01 ff                |...&..!7...|  
 0000002b  
 $   

The timestamp request is only 43 bytes, so we're definitely not sending the actual document (one million bytes) to the TSA. So what's in the request?

Perusing the data, a couple of things (the two mandatory items, in fact) stand out. First, the SHA1 hash of the data that we captured earlier appears here in the timestamp request. Second the OID specifying SHA1 is included here. The OID (1.3.14.3.2.26) is a little less recognizable because of the insane way OIDs are encoded in ASN.1 (section 8.19), but there it is. The only other information included in the request is result of the -cert flag, which is a boolean (true) in the last 3 bytes. I found that the TSA refused to respond if I failed to request the signer include its certificate chain.

Okay, lets re-generate that request and send it to a real TSA:

 $ openssl ts -query -cert -no_nonce < data | curl --data-binary @- http://timestamp.entrust.net/TSS/RFC3161sha2TS | hexdump -C  
 Using configuration from /opt/local/etc/openssl/openssl.cnf  
  % Total  % Received % Xferd Average Speed  Time  Time   Time Current  
                  Dload Upload  Total  Spent  Left Speed  
 100 3773 100 3730 100  43 24064  277 --:--:-- --:--:-- --:--:-- 24500  
 00000000 30 82 0e 8e 30 03 02 01 00 30 82 0e 85 06 09 2a |0...0....0.....*|  
 00000010 86 48 86 f7 0d 01 07 02 a0 82 0e 76 30 82 0e 72 |.H.........v0..r|  
 00000020 02 01 03 31 0b 30 09 06 05 2b 0e 03 02 1a 05 00 |...1.0...+......|  
 00000030 30 81 e2 06 0b 2a 86 48 86 f7 0d 01 09 10 01 04 |0....*.H........|  
 00000040 a0 81 d2 04 81 cf 30 81 cc 02 01 01 06 09 60 86 |......0.......`.|  
 00000050 48 86 fa 6c 0a 03 05 30 21 30 09 06 05 2b 0e 03 |H..l...0!0...+..|  
 00000060 02 1a 05 00 04 14 3d e9 de 78 4b 32 7c 5e ce c6 |......=..xK2|^..|  
 00000070 56 bf bd cf c7 26 d0 f6 21 37 02 06 5d eb 11 2b |V....&..!7..]..+|  
 00000080 86 c9 18 13 32 30 31 39 31 32 32 31 32 30 30 38 |....201912212008|  
 00000090 30 36 2e 30 30 34 5a 30 04 80 02 01 f4 a0 76 a4 |06.004Z0......v.|  
 000000a0 74 30 72 31 0b 30 09 06 03 55 04 06 13 02 43 41 |t0r1.0...U....CA|  
 000000b0 31 10 30 0e 06 03 55 04 08 13 07 4f 6e 74 61 72 |1.0...U....Ontar|  
 000000c0 69 6f 31 0f 30 0d 06 03 55 04 07 13 06 4f 74 74 |io1.0...U....Ott|  
 000000d0 61 77 61 31 16 30 14 06 03 55 04 0a 13 0d 45 6e |awa1.0...U....En|  
 000000e0 74 72 75 73 74 2c 20 49 6e 63 2e 31 28 30 26 06 |trust, Inc.1(0&.|  
 000000f0 03 55 04 03 13 1f 45 6e 74 72 75 73 74 20 54 69 |.U....Entrust Ti|  
 00000100 6d 65 20 53 74 61 6d 70 69 6e 67 20 41 75 74 68 |me Stamping Auth|  
 00000110 6f 72 69 74 79 a0 82 0a 23 30 82 05 08 30 82 03 |ority...#0...0..|  
 <...4KB Later...>
 00000e80 7a 1b a2 95 ec cc 1e d5 33 06 c3 69 61 6a a5 15 |z.......3..iaj..|  
 00000e90 53 0a                                           |S.|  
 00000e92  
 $   

Holy cow! Our 43 byte request provoked a ~4KB response! What is this thing?

Naturally, it's an ASN.1 document, just like everything else involving cryptography. OID 1.2.840.113549.1.7.2 indicates that it is PKCS#7 signed data (that crazy OID formatting again). The signed data includes:

  • The OID indicating a SHA1 hash and the hash result from our request both appear.
  • OID 2.16.840.114028.10.3.5 indicates the Entrust (their ORG ID is 114028) timestamp policy under which this result was issued. See RFC3161(2.1.5). It seems like there should be a document describing their practices somewhere, but I can't find it.
  • Next, we find that the signed document indicates that it was created at 2019-12-21 20:08:06.004 UTC (this is the whole point)
  • The value 01F4 indicates timestamp accuracy of 500ms. This optional field has the ability to represent seconds, milliseconds and microseconds, but this response only includes milliseconds.
  • The readable strings following the accuracy field are optional timestamp GeneralName data identifying the TSA.
  • The rest of the response (most of it, really) is PKCS#7 (signed data) overhead including the signer's signature and certificate chain.
So... Neat! We submitted a hash of some data to the TSA, and it replied with our hash, the time it saw it, and a verifiable signature. This definitely beats dropping hashes on twitter to prove you had some data at a particular time.

One of the places that timestamps of this sort come up is with code signing: Unlike a live transaction (say, TLS handshake), it is possible for a code signing entity to back-date a software release. Because we generally want software builds to work forever once they're created/signed, how do you stop a signer from creating a back-dated, but apparently valid release after their signing certificate has expired or been compromised? The best answer we've got is to include a 3rd party who can be trusted to not fake the date, so software signatures might include one of these timestamps.

Saturday, December 14, 2019

Physically man-in-the-middling an IoT device with Linux Bridge

This is a quick writeup of how I did some analysis of an IoT device (The Thing) by physically inserting a Linux box into the network path between The Thing and the network service it consumed. The approach described here involves being physically close to the target system, but it should work equally well1 anywhere there's an Ethernet link along the path between The Thing and it's server.


First, the topology: The Thing is attached to an Ethernet switch and is part of the 192.168.1.0/24 subnet. We'll be physically inserting ourselves into the path of the red cable in this diagram.

Initial setup

The first step is to get a dual-homed Linux box into the path. I used an Ubuntu 18.04 machine with the following netplan configuration:

 network:  
  version: 2  
  renderer: networkd  
  ethernets:  
   eth0:  
    dhcp4: no  
   eth1:  
    dhcp4: no  
  bridges:  
   br0:  
    addresses: [192.168.1.2/24]  
    gateway4: 192.168.1.1  
    interfaces:  
     - eth0  
     - eth1  


This configuration defines an internal software-based bridge for handling traffic between The Thing and the switch. Additionally, it creates an IP interface for the Linux host to communicate with neighbors attached to the bridge (everybody on 192.168.1.0/24.) The Thing's TCP connection to the server is uninterrupted, even with the MITM box cabled inline like this:

MITM box with software bridge deployment

Now traffic to and from The Thing flows through the Linux machine. Just... Like... Any other switch. Not super useful. Yet.

We'll need some NAT rules:

 # The first rule is an ebtables (like iptables, but for bridged/L2 traffic)  
 # policy that rewrites frames containing northbound traffic. It's looking for:  
 #  frames arriving on eth0 (-i eth0)  
 #  frames containing IPv4 traffic (-p IPv4)  
 #  frames sourced from The Thing (-s <mac-of-The-Thing>)  
 #  frames containing packets destined for the server (--ip-destination <server-ip>)  
 #  
 # Frames matching all of those requirements, get their header rewritten  
 # (--to-destination <mac-of-br0>) so for delivery to the local IP subsystem  
 # (this box) rather than to the gateway router.  

 ebtables -t nat -A PREROUTING -i eth0 -p IPv4 -s <mac-of-The-Thing> --ip-destination <server-ip> -j dnat --to-destination <mac-of-br0>  

 # The second rule is an iptables rule that that rewrites northbound packets.  
 # It's looking for:  
 #  packets arriving on the br0 IP interface (due to previous rule's dMAC rewrite)  
 #  packets destined for the server's IP address  
 #  
 # Packets matching those reqirements get their header rewritten so that they're  
 # directed to a local network service, rather than the intended server on the  
 # Internet.  

 iptables -t nat -A PREROUTING -i br0 -d <server-ip> -j DNAT --to-destination 192.168.1.2  

 # The final rule modifies southbound traffic created by the MITM system so that  
 # it appears to have come from the real server on the Internet. It's an iptables  
 # rule looking for:  
 #  packets leaving the br0 IP interface  
 #  packets destined for The Thing  
 #  
 # Packets matching those requirements get their header rewritten so that they  
 # appear to have been created by the real server on the Internet.  

 iptables -t nat -A POSTROUTING -o br0 -d 192.168.1.50 -j SNAT --to-source <server-ip>  


With the rules installed, the traffic situation looks like this:

 
NAT fools The Thing into talking (and listening) to the MITM


At this point, the NAT rules have broken the application because now, when the client tries to establish a connection to the server, it winds up talking to whatever's listening on the Linux box. Probably there's no listener there, so the client's [SYN] segment sent toward the server (and intercepted by the Linux box) provokes the MITM to respond with a [RST] segment.

We need to create a listener to accept connections from The Thing, a client to connect to the real server, and then stitch these two processes together to get the application running again.

If the client is sending HTTP traffic, we could use a proxy like burp/zap/fiddler to do that job. But what if it's not HTTP traffic? Or if we compulsively do things the hard way? The simplest proxy we can build here consists of back-to-back instances of netcat. For example, if the client is connecting to the server on TCP/443 we'd do:

 # Create a pipe for southbound data:  
   
 mkfifo -p /tmp/southbound-pipe  
   
 # Start two nc instances to perform MITM byte stream relay  
 # between The Thing and the real server:  
   
 nc -l 443 < /tmp/southbound-pipe | nc <server-ip> 443 > /tmp/southbound-pipe  

Here's how that CLI incantation works:

netcat and pipes and redirection, oh my!

So, rather than acting as an Ethernet bridge (layer 2), our MITM is now operating on the byte stream, somewhere around layer 5 (don't think too hard about this).

Can the client or server detect these shenanigans?
  • Both sides will believe they're talking to the usual IP address (client because of NAT trickery; server because all connections appear to come from gateway router).
  • The client will see impossibly fast TCP round-trip times, because the MITM is physically close. This will likely not be noticed.
  • Both sides will likely experience different incoming IP TTL values. Again, not likely noticed.
  • Finally, at the TCP layer it is likely that our MITM box will present different TCP behavior and options than the original client and server, but these will likely be interoperable and go unnoticed except via pcap analysis.

So, about that byte stream... What's in it anyway? Here's how to see inside:

 # save northbound and southbound data to files  
 nc -l 443 < /tmp/southbound-pipe | tee /tmp/client-data | nc <server-ip> 443 | tee /tmp/server-data > /tmp/southbound-pipe  
   
 # ...or...  
   
 # print northbound and southbound data to the terminal  
 nc -l 443 < /tmp/southbound-pipe | tee /dev/fd/2 | nc <server-ip> 443 | tee /dev/fd/2 > /tmp/southbound-pipe  


If the service is running on TCP/443 as in this example, we're probably going to be disappointed when we look at the intercepted data. Even though we've MITM'ed the TCP bytestream, the TLS session riding on it remains intact, so we're MITMing and relaying an encrypted byte stream.

We need to go deeper. If we have a certificate (and private key) trusted by the client device, we can do that by using openssl s_client and openssl s_server in place of nc:


 mkfifo -p /tmp/cleartext-pipe  
 openssl s_server -cert cert.pem -key key.pem -port 443 < /tmp/cleartext-pipe | tee /tmp/client-data | openssl s_client -connect <server-ip>:443 | tee /tmp/server-data > /tmp/cleartext-pipe  

Will the client or server notice now? Because we're terminating TLS, it provides a whole new layer (keys, certificates, ciphers, etc...) where these shenanigans can be noticed and/or lead to opportunities problems.

Do you need to physically MITM things like this? Probably not. Launching an ARP poisoning attack would likely have led to the same result, but this approach is a little more reliable and definitely more interesting.


1 Subject to performance limitations of your Linux bridge, I guess. Don't go trying this on 100Gb/s links :)

Friday, February 15, 2019

SSH to all of the serial ports

This is just a quick-and-dirty script for logging into every serial port on an Opengear box, one in each tab of a MacOS terminal.

Used it just recently because I couldn't remember where a device console was connected.

Don't change mouse focus while it's running: It'll wind up dumping keystrokes into the wrong window.

for i in $(seq 48)
do
  port=$(expr 3000 + $i)
  sshcmd="ssh -p $port terminalserver"
  osascript \
    -e 'tell application "Terminal" to activate' \
    -e 'tell application "System Events" to tell process "Terminal" to keystroke "t" using command down' \
    -e "tell application \"System Events\" to tell process \"Terminal\" to keystroke \"$sshcmd\"" \
    -e "tell application \"System Events\" to tell process \"Terminal\" to key code 36"
done


Leaving it here in case somebody (probably me) finds it useful in the future.

Monday, January 21, 2019

Cannot connect the virtual device ... because no corresponding device is available on the host

Recently I've been building some VM templates on my MacBook and launching instances of them in VMware. Each time it produced following error:

Cannot connect the virtual device sata0:1 because no corresponding device is available on the host.


Either button caused the guest to boot up. The "No" button ensured that it booted without error on subsequent reboots, while choosing "Yes" allowed me to enjoy the error with each power-on of the guest.

Sata0 is, of course a (virtual) disk controller, and device 1 is an optical drive. I knew that much, but the exact meaning of the error wasn't clear to me, and googling didn't lead to a great explanation.

I wasn't expecting there to be a "corresponding device ... available on the host" because the host has neither a SATA controller nor an optical drive, and no such hardware should be required for my use case, so, what did the error mean?

It turns out that I was producing the template (a .ova file) with the optical drive "connected" (VMware term) to ... something. The issue isn't related to the lack of a device on the host, but that there's no ISO file "inserted" into the virtual drive.

Here's the relevant stanza from the template's .ovf file:

      <Item>
        <rasd:AddressOnParent>1</rasd:AddressOnParent>
        <rasd:AutomaticAllocation>true</rasd:AutomaticAllocation>
        <rasd:Caption>cdrom1</rasd:Caption>
        <rasd:Description>CD-ROM Drive</rasd:Description>
        <rasd:ElementName>cdrom1</rasd:ElementName>
        <rasd:InstanceID>7</rasd:InstanceID>
        <rasd:Parent>5</rasd:Parent>
        <rasd:ResourceType>15</rasd:ResourceType>

      </Item>

The problem here is the AutomaticaAllocation element. AutomaticAllocation is VMware's "connect at boot" feature, which really boils down to "insert a disk into this virtual drive". Without specifying a backend device and/or file, it can't be connected.

Deleting the element, or setting it to "false" fixes the problem.

AddressOnParent: 1 <- This makes it sata0:1 instead of sata0:0
InstanceID: 7 <- This device is the 8th hardware device defined for the guest
Parent: 5 <- This device is a child of the hardware device at instance id #5 (the sata0)
ResourceType: 15 <- This is an optical drive

Saturday, November 11, 2017

Syslog relay with Scapy

I needed to point some syslog data at a new toy being evaluated by security folks.

Reconfiguring the logging sources to know about the new device would have been too much of a hassle for a quick test. Reconfiguring the Real Log Server (an rsyslog box) to relay the logs wasn't viable because the source IP in the syslog packets would have reflected the syslog box instead of the origin server.

A few lines of python running on the existing rsyslog box did the trick:

 #!/usr/bin/env python2.7  
   
 from scapy.all import *  
   
 def pkt_callback(pkt):  
   del pkt[Ether].src  
   del pkt[Ether].dst  
   del pkt[IP].chksum  
   del pkt[UDP].chksum  
   pkt[IP].dst = '192.168.100.100'  
   sendp(pkt)  
   
 sniff(iface='eth0', filter='udp port 514', prn=pkt_callback, store=0)  

This script has scapy collecting frames matching udp port 514 (libpcap filter) from interface eth0. Each matching packet is handed off to the pkt_callback function. It clears fields which need to be recalculated, changes the destination IP (to the address of the new Security Thing) and puts the packets back onto the wire.

The source IP on these forged packets is unchanged, so the Security Thing thinks it's getting the original logs from real servers/routers/switches/PDUs/weather stations/printers/etc... around the environment.

I'd expected to need to filter out the packets that scapy is sending (don't listen to and re-send your own noise), but that doesn't seem to have been necessary.

Thursday, October 5, 2017

SSH HashKnownHosts File Format

The HashKnownHosts option to the OpenSSH client causes it obfuscate the host field of the ~/.ssh/known_hosts file. Obfuscating this information makes it harder for threat actors (malware, border searches, etc...) to know which hosts you connect to via SSH.

Hashing defaults to off, but some platforms turn it on for you:

 chris:~$ grep Hash /etc/ssh/ssh_config   
   HashKnownHosts yes  
 chris:~$   

Here's an entry from my known_hosts file:

 |1|NWpzcOMkWUFWapbQ2ubC4NTpC9w=|ixkHdS+8OWezxVQvPLOHGi2Oawo= ecdsa-sha2-nistp256 AAAAE2Vj<...>ZHNLpyJsv  

There's one record per line, with the fields separated by spaces. The first field is the remote host (SSH server) identifier.

In this case, the leading characters |1| in the host identifier are the magic string (HASH_MAGIC). It tells us that the field is hashed, rather than a plaintext hostname (or address). The remaining characters in the field comprise two parts: a 160-bit salt (random string) and a 160-bit SHA1 hash result. Both values are base64 encoded.

The various OpenSSH binaries that use information in this file feed both the remote hosts name (or address) and the salt to the hashing function in order to produce the hash result:


So, lets validate a host entry against this record the hard way. The entry above is for an IP address: 10.0.0.1.

 chris:~$ host="10.0.0.1"  
 chris:~$ salt_from_file="NWpzcOMkWUFWapbQ2ubC4NTpC9w="  
 chris:~$ salt_hexdump=$(echo $salt_from_file | base64 --decode | xxd -p)  
 chris:~$ echo -n $host | openssl sha1 -binary -mac HMAC -macopt hexkey:$salt_hexdump | base64  
 ixkHdS+8OWezxVQvPLOHGi2Oawo=  
 chris:~$   

The resulting string (ixkHdS+8OWezxVQvPLOHGi2Oawo=) is the base64 encoded hash result produced by inputting our host IP and the salt we found in the file. It's the same string that we saw in the known_hosts entry, so we know that this entry is for the host 10.0.0.1.

When adding a new record to known_hosts, the salt is a random value invented on the spot. The hash is calculated and the salt, hash and key details are written to the file.

When trying to find a record in an existing known_hosts file, the SSH program can't pick the right line directly. Instead it has to take the hostname (address) it's looking for, and compute the hash using the salt found on each line. When (if) it finds a match, then that's the line it was looking for. SHA1 happens pretty fast on modern hardware, but depending on your use case, this may be a bunch of wasted effort, particularly on systems where there's no point in obfuscating the list of SSH servers to which we connect.



These folks drew the cocktail shaker.

Tuesday, September 26, 2017

Pluribus Networks... Wait, where are we again?


I was privileged to visit Pluribus Networks as a delegate at Network Field Day 16 a couple of weeks ago. Somebody else paid for the trip. Details here.

Much has changed at Pluribus, I hardly recognized the place!

I quite like Pluribus (their use of Solaris under their Netvisor switching OS got me right in the feels early on) so I'm happy to report that most of what's new looks like changes for the better.

When we arrived at Pluribus HQ we were greeted by some new faces, a new logo, color scheme... Even new accent lighting in the demo area!

Gone also are the Server Switches with their monstrous control planes (though still listed on the website, they weren't mentioned in the presentation), Solaris, and a partnership with Supermicro.

In their place we found:

  • The new logo and colors
  • New faces in management and marketing
  • Netvisor running on Linux
  • Whitebox and OCP-friendly switches
  • A partnership with D-Link
  • Some Netvisor improvements

Linux

This was probably inevitable, and likely matters little to Netvisor users. When Pluribus was first getting off the ground, I was waiting for an OpenSolaris release that never happened. That Pluribus stuck with Solaris for as long as they did while Oracle was dismantling the Solaris ecosystem is kind of incredible. Netvisor on Linux is fine, I'm sure.

Switch Hardware

One of Pluribus' claims to fame was their "server switches". These were normal switches using merchant switching silicon (from 2 or 3 different vendors, if I recall... I think Netvisor has a hardware abstraction layer which allows them to switch easily between Broadcom/Intel/Mellanox ASICs), but with enormous control planes sporting lots of cores, lots of RAM, lots of storage, dedicated network processors, etc...

The big switches opened the door to some interesting possibilities, but likely made a tough sell to customers that just wanted an IP network fabric. Which is probably most customers.

These days Pluribus is selling vanilla-looking Open Compute-friendly switches with ONIE, and supporting Netvisor on a handful of 3rd party whitebox platforms.

That D-Link Partnership

Okay, quit laughing. The D-Link switch in question is Trident II based, just like (almost) every other switch in the market. If D-Link helps Pluribus move product, then I'm delighted for all involved. The only thing I don't like about the DXS-5000-54S is that it lacks an RS-232 port. USB console? Ugh. I'll run my Netvisors on something with a proper management interface, thankyouverymuch.

Netvisor

Netvisor still looks pretty great! Some standout features:
  • Netvisor uses standard protocols to interact with neighboring devices, but you manage a Netvisor fabric as a single device.
  • It's still got fantastic telemetry and flow analytics capabilities, even without the monstrous control plane. Some slightly outrageous claims were made in this area toward the end of the presentation, but we didn't have time to dig in.
  • Individual nodes are managed in-band (via the front-panel interfaces, rather than the management LAN port). Incredibly this capability is not universal in this product space. Some platforms rely on the lone management Ethernet interface for fabric control purposes. This fact blows my mind. I'm similarly surprised that whitebox switches don't tend to come with redundant control plane paths. Maybe there's a single "eth0" port baked into the Trident chip for this purpose?
  • Routing is performed by an anycast gateway. That is, moving packets from one broadcast domain to another does not require them to be hauled to a certain point in the fabric. Any Netvisor switch (the nearest Netvisor switch) will do the job. This is a welcome change.
  • Members of a Netvisor fabric don't need to be cabled to one another. This opens the door to using Netvisor only at the leaf tier in a leaf/spine fabric... Or only at the spine... Or at both layers as a single large fabric... Or at both layers, but as two fabrics (one for leaf, one for spine)... Or as smaller deployment units in a huge fabric. Lots of possibilities here.

Saturday, September 23, 2017

KEMP Presented Some Interesting Features at NFD16

KEMP Technologies presented at Network Field Day 16, where I was privileged to be a delegate. Who paid for what? Answers here.


Three facets of the KEMP presentation stood out to me:


The KEMP Management UI Can Manage Non-KEMP Devices

KEMP's centralized management UI, the KEMP 360 Controller, can manage/monitor other load balancers (ahem, Application Delivery Controllers) including AWS ELB, HAProxy, NGINX and F5 BIG-IP.

This is pretty clever: If KEMP gets into an enterprise, perhaps because it's dipping a toe into the cloud at Azure, they may manage to worm their way deeper than would otherwise have been possible. Nice work, KEMPers.

VS Motion Can Streamline Manual Deployment Workflows

KEMP's VS Motion feature allows easy service migrations between KEMP instances by copying service definitions from one box to another. It's probably appropriate when replicating services between production instances and when promoting configurations between dev/test/prod. The mechanism is described in some detail here:


The interface is pretty straightforward. It looks just like the balance transfer UI at my bank: Select the From instance, the To instance, what you want transferred (which virtual service) and then hit the Move button. The interface also sports a Copy button, so in that regard it's better than the UI at the bank. I look forward to the bank allowing me to replicate funds between accounts in the future :)

I think it struck all of the Network Field Day delegates that this feature is primarily useful for manual workflows. An automated workflow wouldn't need an "easy button" like the VS Motion feature. Unfortunately there wasn't enough time to get into KEMP's Automation/API capabilities during the presentation, but Keith Miller was tuned into the live stream and reported that the API is a pleasure to use:
Update from Keith:

It's disappointing to read that the API doesn't return structured data.

VS Motion does not, as I understand it, have the ability to copy TLS certificates around right now, but  the feature request is in.

That Strange License

Frankly, this topic from the NFD16 presentation doesn't make much sense to me.

When you're buying boxes, or even virtual capabilities that are licensed by a bandwidth cap, you're going to have paid-for-but-wasted capacity during off-peak times. KEMP has introduced a consumption-based model to work around that problem: Pay only for what you use!

It sounds great, especially with the popularity of virtual services. When talking about physical boxes, it makes sense that you'd have to pay for any overcapacity you may have provisioned: There's the box, 95% idle, waiting for that peak traffic day... Full of expensive processors and RAM... Oh, and there's the failover box, at 100% idle... You probably didn't expect to get the hardware for free, right?

The situation feels different when we're talking about virtual appliances: How much would you expect to pay for a virtual standby server? One which, if everything goes according to plan, will never see a request from a live client? You're already paying somebody else (the server vendor or IaaS provider) for the hardware, so paying KEMP based on usage seems ideal.

But they've created an altogether new problem: KEMP's consumption based license model finds the peak throughput (at 5 minute intervals) of each participating node, then adds them up to calculate the monthly bill.

Let's imagine that your organization has a rock-steady 1Gb/s flow rate through an active/standby pair of KEMP boxes, plus a DR facility somewhere.

Every month you pay for 1Gb/s of usage.

Then one day the active unit fails, load switches to the standby unit. Several hours later, you shift workload to the DR site while performing maintenance to restore the failed hardware in the main site.

Take the peak throughput from each KEMP unit: Active (failed), Standby (now active) and DR have each hit 1Gb/s. That month you'll pay for 3Gb/s, even though the workload never changed. You just moved it around.

It seems like anybody with any degree of workload mobility will be overpaying with this model, unless the per-bandwidth price is also quite low.

I'd be much more comfortable paying per byte, per TLS setup or per load-balanced request. The sum-of-peaks model seems too unpredictable to me.

Thursday, August 31, 2017

Using FQDN for DMVPN hubs

I've done some testing with specifying DMVPN hubs (NHRP servers, really) using their DNS name, rather than IP address.

This matters to me because of some goofy environments where spoke routers can't predict what network they'll be on (possibly something other than internet), and where I can't leverage multiple hubs per tunnel due to a control plane scaling issue.

The DNS-based configuration includes the following:

 interface Tunnel1  
  ip nhrp nhs dynamic nbma dmvpn-pool.fragmentationneeded.net  

There's no longer a requirement for any ip nhrp map or ip nhrp nhs x.x.x.x configuration when using this new capability.

My testing included some tunnels with very short ISAKMP and IPSec re-key intervals. I found that the routers performed the DNS resolution just once. They didn't go back to DNS again for as long as the hub was reachable.

Spoke routers which failed to establish a secure connection for whatever reason would re-resolve the hub address each time the DNS response expired its TTL. But once they succeeded in connecting, I observed no further DNS traffic for as long as the tunnel survived.

The record I published (dmvpn-pool.fragmentationneeded.net above) includes multiple A records. The DNS server randomizes the record order in its responses and spoke routers always connected to the first address on the list.

The random-ordered DNS response makes for a kind of nifty load balancing and failover capability:

  1. The spokes will naturally balance across the population of hubs, depending on the whim of the DNS server
  2. I don't strictly need a smart (GSLB style) DNS server to effect failover, because spokes will eventually find their way to a working hub, even with bad records in the list.


With 3 hub routers, the following happens when one fails:

  • At T=0, 67% of the routers remain connected.
  • At T=<keepalive>s, 89% of routers are connected (2/3 of the orphans are back online. The others are trying the dead hub again).
  • At T=TTLx1, 96% of routers are connected (1/3 of the orphans from the previous interval tried the dead hub a second time)
  • At T=TTLx2, 99% of routers are back online
Things recover fairly quickly with short TTL intervals, even without a GSLB because the spokes keep trying, and only need to find a working record once. This DMVPN tunnel isn't the only path in my environment, so a couple of minutes outage is acceptable.


A 60 second TTL will result in ~40K queries/month for each spoke that can't connect (problems with firewall, overload NAT, credentials, etc...), so watch out for that if you're using a service that causes you to pay per query :)

Wednesday, August 30, 2017

Small Site Multihoming with DHCP and Direct Internet Access

Cisco recently (15.6.3M2 ) resolved CSCve61996, which makes it possible to fail internet access back and forth between two DHCP-managed interfaces in two different front-door VRFs attached to consumer-grade internet service.

Prior to the IOS fix there was a lot of weirdness with route configuration on DHCP interfaces assigned to VRFs.

I'm using a C891F-K9 for this example. The WAN interfaces are Gi0 and Fa8. They're in F-VRF's named ISP_A and ISP_B respectively:


First, create the F-VRFs and configure the interfaces:

 ip vrf ISP_A  
 ip vrf ISP_B  
   
 interface GigabitEthernet8  
  ip vrf forwarding ISP_A  
  ip dhcp client default-router distance 10  
  ip address dhcp  
 interface FastEthernet0  
  ip vrf forwarding ISP_B  
  ip dhcp client default-router distance 20  
  ip address dhcp  

The distance commands above assign the AD of the DHCP-assigned default route. Without these directives the distance would be 254 in each VRF. They're modified here because we'll be using the distance to select the preferred internet path when both ISPs are available.

Next, let's keep track of whether or not the internet is working via each provider. In this case I'm pinging 8.8.8.8 via both paths, but this health check can be whatever makes sense for your situation. So, a couple of IP SLA monitors and track objects are in order:

 ip sla 1  
  icmp-echo 8.8.8.8  
  vrf ISP_A  
  threshold 500  
  timeout 1000  
  frequency 1  
 ip sla schedule 1 life forever start-time now  
 track 1 ip sla 1  
   
 ip sla 2  
  icmp-echo 8.8.8.8  
  vrf ISP_B  
  threshold 500  
  timeout 1000  
  frequency 1  
 ip sla schedule 2 life forever start-time now  
 track 2 ip sla 2  

Ultimately we'll be withdrawing the default route from each VRF when we determine that the internet has failed. This is introduces a problem: With the default route missing the SLA target will be unreachable. The SLA (and track) will never recover, so the default route will never be restored. So first let's add a static route to our SLA target in each VRF. The default route will get withdrawn, but the host route for the SLA target will persist in each VRF.

 ip route vrf ISP_A 8.8.8.8 255.255.255.255 dhcp 50  
 ip route vrf ISP_B 8.8.8.8 255.255.255.255 dhcp 60  

We used the dhcp keyword as a stand-in for the next-hop IP address. We could have just specified the interface, but specifying a multiaccess interface without a neighbor ID is an ugly practice and assumes that proxy ARP is available from neighboring devices. Not a safe assumption.

Finally, we can set the default route to be withdrawn when the track object goes down:

 interface GigabitEthernet8  
  ip dhcp client route track 1  
   
 interface FastEthernet0  
  ip dhcp client route track 2  

At this point, when everything is healthy, the routing table for ISP_A looks something like this:

 S*  0.0.0.0/0 [10/0] via 192.168.1.126  
    8.0.0.0/32 is subnetted, 1 subnets  
 S    8.8.8.8 [50/0] via 192.168.1.126  
    192.168.1.0/24 is variably subnetted, 2 subnets, 2 masks  
 C    192.168.1.64/26 is directly connected, GigabitEthernet8  
 L    192.168.1.67/32 is directly connected, GigabitEthernet8  

The table for ISP_B looks similar, but with different Administrative Distances. On failure of the SLA/track the default route gets withdrawn but the 8.8.8.8/32 route persists. That looks like this:

    8.0.0.0/32 is subnetted, 1 subnets  
 S    8.8.8.8 [50/0] via 192.168.1.126  
    192.168.1.0/24 is variably subnetted, 2 subnets, 2 masks  
 C    192.168.1.64/26 is directly connected, GigabitEthernet8  
 L    192.168.1.67/32 is directly connected, GigabitEthernet8  

When the ISP is healed, the 8.8.8.8/32 ensures that we'll notice, the SLA will recover, and the default route will be restored.

Okay, now it's time to think about leaking these ISP_A and ISP_B routes into the global routing table (GRT). First, we need an interface in the GRT for use by directly connected clients:

 interface Vlan10  
  ip address 10.10.10.1 255.255.255.0  

And now the leaking configuration:

 ip prefix-list PL_DEFAULT_ONLY permit 0.0.0.0/0  
   
 route-map RM_IMPORT_TO_GRT permit  
  match ip address prefix-list PL_DEFAULT_ONLY  
   
 global-address-family ipv4  
  route-replicate from vrf ISP_A unicast static route-map RM_IMPORT_TO_GRT  
  route-replicate from vrf ISP_B unicast static route-map RM_IMPORT_TO_GRT  

The configuration above leaks only the default route from each F-VRF. The GRT will be offered both routes and will make its selection based on the AD we configured earlier (values 10 and 20).

Here's the GRT with everything working:

 S* + 0.0.0.0/0 [10/0] via 192.168.1.126 (ISP_A)  
    10.0.0.0/8 is variably subnetted, 2 subnets, 2 masks  
 C    10.10.10.0/24 is directly connected, Vlan10  
 L    10.10.10.1/32 is directly connected, Vlan10  

When the ISP_A path fails, the GRT fails over to the higher distance route via ISP_B:

 S* + 0.0.0.0/0 [20/0] via 192.168.1.62 (ISP_B)  
    10.0.0.0/8 is variably subnetted, 2 subnets, 2 masks  
 C    10.10.10.0/24 is directly connected, Vlan10  
 L    10.10.10.1/32 is directly connected, Vlan10  

Strictly speaking, it's not necessary to have the SLA monitor, track object and conditional routing in VRF ISP_B. All of those things could be omitted and the GRT would still fail back and forth between the different F-VRFs based only on the tests in "A". But I like the symmetry.

Okay, so now that we've got the GRT's default route flopping back and forth between these two front-door VRFs, we'll need some NAT. First, enable NVI mode on each interface in the transit path:

 interface GigabitEthernet8   
  ip nat enable  
 interface FastEthernet0  
  ip nat enable  
 interface Vlan10  
  ip nat enable  

Next we'll spell out exactly what's going to get NATted. I like to use route-maps rather than ACLs because the templating is easier when we're matching interfaces rather than ip prefixes:

 route-map RM_NAT->ISP_A permit 10  
  match interface GigabitEthernet8  
   
 route-map RM_NAT->ISP_B permit 10  
  match interface FastEthernet0  
   
 ip nat source route-map RM_NAT->ISP_A interface GigabitEthernet8 overload  
 ip nat source route-map RM_NAT->ISP_B interface FastEthernet0 overload  

That's basically it. The last thing that might prove useful is to automate purging of NAT translation tables when switching between providers. TCP flows can't survive the ISP switchover, and clearing the NAT translations for active flows should make them fail faster than they might have otherwise.

Saturday, June 17, 2017

Serial Pinout for APC

This is just a quick note to remind me how to make serial cables for APC power strips. This cable works between an APC AP8941 and an Opengear terminal server with Cisco-friendly (-X2 in Opengear nomenclature) pinout.


Only pins 3,4 and 6 are populated on the 8P8C end. It probably doesn't matter whether the ground pin (black) lands on pin 4 or 5 because both should be ground on the Opengear end. The yellow wire is unused.

Tuesday, March 21, 2017

Cisco: Not Serious About Network Programmability

"You can't fool me, there ain't no sanity clause!"
Cisco isn't known for providing easy programmatic access to their device configurations, but has recently made some significant strides in this regard.

The REST API plugin for newer ASA hardware is an example of that. It works fairly well, supports a broad swath of device features, is beautifully documented and has an awesome interactive test/dev dashboard. The dashboard even has the ability to spit out example code (java, javascript, python) based on your point/click interaction with it.

It's really slick.

But I Can't Trust It

Here's the problem: It's an un-versioned REST API, and the maintainers don't hesitate to change its behavior between minor releases. Here's what's different between 1.3(2) and 1.3(2)-100:

New Features in ASA REST API 1.3(2)-100

Released: February 16, 2017
As a result of the fix for CSCvb21388, the response type of /api/certificate/details was changed from the CertificateDetails object to a list of CertificateDetails. Scripts utilizing this API will need to be modified accordingly.

So, any code based on earlier documentation is now broken when it calls /api/certificate/details.

This Shouldn't Happen

Don't take my word for it:



Remember than an API is a published contract between a Server and a Consumer. If you make changes to the Servers API and these changes break backwards compatibility, you will break things for your Consumer and they will resent you for it. 




It Gets Worse

Not only does the API fail to provide consistently formatted responses, it doesn't even provide a way to discover its version. Cisco advised me to scrape the 'show version' CLI output in order to divine the correct way to parse the API's responses. Whenever they decide to change things.

The irony of having to abandon the API for screen scraping in order to improve API compatibility is almost too much to bear. Lets assume for the moment that I'm willing to do it. Will the regex that finds the API version today still work on tomorrow's release? Do I even know how to parse the version numbers?

What's the version number of the current release anyway?

  • 1.3(2)-100 (according to the release notes above)
  • 1.3.2.100 (according to show version CLI output)
  • 1.3.2 (according to the 'release:' field on the download page)
This does not look like a road I'm going to enjoy traveling.

Would You Use This API?

When I inquired about version-to-version incompatibilities, Cisco's initial response was:
"This definitely shouldn't be happening."
Followed by:
"We are aware of the limitations resulting for not having versioned ASA REST API releases. And as of now there are no plans for us to fix this."
 Further followed by:
"we will update the documentation to reflect the correct behavior, once we post this fix to CCO."
So hey, no problem right? We might sneak breaking changes into the smallest of maintenance releases, but at least we'll document it! Have fun selling and supporting your application!

Clearly I am one of the angry and resentful customers predicted by the articles quoted above :)

Friday, March 17, 2017

Epoch Rollover: Coming Two Years Early To A Router Near You!

The 2038 Problem

Broken Time? -  Roeland van der Hoorn
Many computer systems and applications keep track of time by counting the seconds from "the epoch", an arbitrary date. Epoch for UNIX-based systems is the stroke of midnight in Greenwich on 1 January 1970.

Lots of application functions and system libraries keep track of the time using a 32-bit signed integer, which has a maximum value of around 2.1 billion. It's good for a bit more than 68 years worth of seconds.

Things are likely to get weird 2.1 billion seconds after the epoch on January 19th, 2038.

As the binary counter rolls over from 01111111111111111111111111111111 to 10000000000000000000000000000000, the sign bit gets flipped. The counter will have changed from its farthest reach after the epoch to its farthest reach before the epoch. time will appear to have jumped from early 2038 to late 1901.

Things might even get weird within the next year (January 2018!) as systems begin encounter freshly minted CA certificates with expirations after the epoch rollover (it's common for CA certificates to last for 20 years.) These certificates may appear to have expired in late 1901, over a century prior to their creation.

NTP's 2036 Problem

NTP has a similar, but not-quite-the-same epoch problem. It keeps track of seconds in an unsigned 32-bit value, so it can count twice as high as the problematic UNIX counter (yay!) but NTP's epoch is set 70 years earlier: 1 January 1900 (boo!) The result is that NTP's counter will roll over about 2 years before the UNIX counter.

Practically speaking, NTP's going to be fine for reasons having to do with it being primarily concerned about small offsets in relative time, and it only having to be within 68 years of correct on startup in order to sync up with an authoritative time source.

So What's Up With This Router?

Here's a weird thing I stumbled across recently. Time calculations with dates in 2036 are going wrong but they're unrelated to NTP:

 router#show crypto pki certificates test-1 
 CA Certificate  
  Status: Available  
  Certificate Serial Number (hex): 14  
  Certificate Usage: Signature  
  Issuer:   
   cn=test-2  
  Subject:   
   cn=test-2  
  Validity Date:   
   start date: 02:38:26 UTC Mar 17 2017  
   end  date: 00:00:00 UTC Jan 1 1900  
  Associated Trustpoints: test-1   

But this one looks okay:

 router#show crypto pki certificates test-2
 CA Certificate  
  Status: Available  
  Certificate Serial Number (hex): 12  
  Certificate Usage: Signature  
  Issuer:   
   cn=test-1  
  Subject:   
   cn=test-1  
  Validity Date:   
   start date: 02:37:31 UTC Mar 17 2017  
   end  date: 06:28:15 UTC Feb 7 2036  
  Associated Trustpoints: test-2   

The real expiration dates of these certificates is just one second apart:

$ openssl x509 -in test-1.crt -noout -enddate
notAfter=Feb 7 06:28:16 2036 GMT
$ openssl x509 -in test-2.crt -noout -enddate
notAfter=Feb 7 06:28:15 2036 GMT
So... That's unfortunate.

Here's the actual certificate data and import procedure used for this experiment in case you feel inclined to test:

 crypto pki trustpoint test-1  
  enrollment terminal  
 crypto pki authenticate test-1  
 -----BEGIN CERTIFICATE-----  
 MIIBeDCCASKgAwIBAgIBFDANBgkqhkiG9w0BAQUFADARMQ8wDQYDVQQDDAZ0ZXN0  
 LTIwIBcNMTcwMzE3MDIzODI2WhgPMjAzNjAyMDcwNjI4MTZaMBExDzANBgNVBAMM  
 BnRlc3QtMjBcMA0GCSqGSIb3DQEBAQUAA0sAMEgCQQDUjEccGNjjtv8lKNnvGpta  
 Z4x8LB82D2JJwTcvA5blUI2nr4vF41RqG0ifZ+Qtyqo+ntSD2QzDu3LKdSUw46if  
 AgMBAAGjYzBhMA4GA1UdDwEB/wQEAwIBBjAPBgNVHRMBAf8EBTADAQH/MB0GA1Ud  
 DgQWBBQ2NpEF0FG/g3ryNgU7Skjbm4IGHTAfBgNVHSMEGDAWgBQ2NpEF0FG/g3ry  
 NgU7Skjbm4IGHTANBgkqhkiG9w0BAQUFAANBAIVyT+iBimH7c/jtBrFGmKq+7YdM  
 eMwf9I/En/TAUqtte7QGLNRyTgBJvGgN/uc0KUjlZ5D6G/kxTwDtzse2Uow=  
 -----END CERTIFICATE-----  
 quit  
   
 crypto pki trustpoint test-2  
  enrollment terminal  
 crypto pki authenticate test-2  
 -----BEGIN CERTIFICATE-----  
 MIIBeDCCASKgAwIBAgIBEjANBgkqhkiG9w0BAQUFADARMQ8wDQYDVQQDDAZ0ZXN0  
 LTEwIBcNMTcwMzE3MDIzNzMxWhgPMjAzNjAyMDcwNjI4MTVaMBExDzANBgNVBAMM  
 BnRlc3QtMTBcMA0GCSqGSIb3DQEBAQUAA0sAMEgCQQDUjEccGNjjtv8lKNnvGpta  
 Z4x8LB82D2JJwTcvA5blUI2nr4vF41RqG0ifZ+Qtyqo+ntSD2QzDu3LKdSUw46if  
 AgMBAAGjYzBhMA4GA1UdDwEB/wQEAwIBBjAPBgNVHRMBAf8EBTADAQH/MB0GA1Ud  
 DgQWBBQ2NpEF0FG/g3ryNgU7Skjbm4IGHTAfBgNVHSMEGDAWgBQ2NpEF0FG/g3ry  
 NgU7Skjbm4IGHTANBgkqhkiG9w0BAQUFAANBAIjboo8wtehMpOReLw01tW8MLYzl  
 rtpwYVGoHCVVpXU+s7YQtfR1pt5ZVHZ8OVeP8SoTtoS+5k97aWgBZ+hu8/M=  
 -----END CERTIFICATE-----  
 quit  

Wednesday, February 1, 2017

Docker's namespaces - See them in CentOS

In the Docker Networking Cookbook (I got my copy directly from Pact Publishing), Jon Langemak explains why the iproute2 utilities can't see Docker's network namespaces: Docker creates its namespace objects in /var/run/docker/netns, but iproute2 expects to find them in /var/run/netns.

Creating a symlink from /var/run/docker/netns to /var/run/netns is the obvious solution:

 $ sudo ls -l /var/run/docker/netns  
 total 0  
 -r--r--r--. 1 root root 0 Feb 1 11:16 1-6ledhvw0x2  
 -r--r--r--. 1 root root 0 Feb 1 11:16 ingress_sbox  
 $ sudo ip netns list  
 $ sudo ln -s /var/run/docker/netns /var/run/netns  
 $ sudo ip netns list  
 1-6ledhvw0x2 (id: 0)  
 ingress_sbox (id: 1)  
 $  

But there's a problem. Look where this stuff is mounted:

 $ ls -l /var/run  
 lrwxrwxrwx. 1 root root 6 Jan 26 20:22 /var/run -> ../run  
 $ df -k /run  
 Filesystem   1K-blocks Used Available Use% Mounted on  
 tmpfs      16381984 16692 16365292  1% /run  
 $   

The symlink won't survive a reboot because it lives in a memory-backed filesystem. My first instinct was to have a boot script (say /etc/rc.d/rc.local) create the symlink, but there's a much better way.

Fine, I'm starting to like systemd

Systemd's tmpfiles.d is a really elegant way of handling touch files, symlinks, empty directories, device nodes, pipes and whatnot which live in volatile filesystems. The feature works from these directories:
  • /etc/tmpfiles.d
  • /run/tmpfiles.d
  • /usr/lib/tmpfiles.d
When the directives found in these directories contradict one another, the instance I've listed earlier wins. This allows an administrator to override package declarations in /usr/lib/tmpfiles.d by creating an entry in /etc/tmpfiles.d. Conflicts between files are resolved by the order of their appearance in a lexical sort.

So, what goes in these directories? Files named <whatever>.conf. Each line in these files controls creation of a file / folder / symlink / etc... There are switches and options to control ownership, permissions, overwrite condition, contents, and so forth.

Here's the file that causes systemd to create my symlink on every boot:

 $ cat /etc/tmpfiles.d/netns.conf   
 L /run/netns - - - - ./docker/netns  

I'm still not quite ready to forgive systemd for taking away the udev network interface naming persistency stuff and replacing it with something that's useless in virtual machines (this helps). But I'm getting there.

Lately I've been really liking each new facet of systemd as I've discovered it.