Tuesday, September 28, 2010

Layer 2 Traceroute

Cisco switches have a nifty but little used diagnostic feature:  The 'traceroute mac' command.

It does pretty much what you'd expect.  It traces the L2 path between two endpoints.  Exactly how it accomplishes this feat is much less obvious.  Normal (Layer 3) traceroute makes use of progressively larger TTL marks, and uses the ICMP "time to live exceeded" errors from routers along the path in order to print the path between two nodes.  These mechanisms don't exist in a bridged environment.  So how does it work?

Consider the following topology:

Running traceroute mac on the rightmost switch produces the following result:
Cat2960#traceroute mac 000b.5f73.0491 000b.5f73.0480 vlan 11
Source 000b.5f73.0491 found on Cat2960
1 Cat2960 ( : Gi0/20 => Po1
2 Cat3550 ( : Po1 => Fa0/20
3 Cat2950 ( : Fa0/1 => Fa0/12
Destination 000b.5f73.0480 found on Cat2950
Layer 2 trace completed

The procedure for doing this work manually is straightforward.  We look for each MAC address in the VLAN 11 forwarding table, then check to see whether the egress port has a CDP neighbor.  If so, log into the next switch (using the management address reported by CDP).  Lather, rinse, repeat.

The manual procedure works on each MAC address independently.  The 'traceroute mac' command, however requires you to specify both source and destination stations.  It's curious:  L3 traceroute doesn't do that, it assumes you want to trace the path from here to somewhere else.  'traceroute mac' on the other hand can trace from somewhere else to somewhere else.  Here we run a trace from a switch that isn't in the transit path between the two end stations.
Cat2950#traceroute mac  0800.870e.b6b1 000b.5f73.0491 vlan 11
Source 0800.870e.b6b1 found on Cat2960
1 Cat2960 ( : Gi0/10 => Gi0/20
Destination 000b.5f73.0491 found on Cat2960
Layer 2 trace completed
In that case, we were logged into the 2950, but traced the L2 path between two stations that were each connected directly to the 2960.  Two L2 hops away.  The operation of this tool is somewhat mysterious.  Here's what I can tell about it so far:
  • Both MAC addresses must appear in the forwarding table on each switch in the path.
  • Each switche uses CDP to figure out what next hop lies beyond its local egress port (and the IP address on which it can be reached).
  • It's an L3 process.  The right switch in the figure above is in a different management subnet than his neighbors.  These switches all forward traffic directly at L2, but they talk amongst themselves via the router-on-a-stick.  This is different than the operation of CDP (a link layer protocol).
  • If more than one CDP neighbor appears on an interface in the path (several switches hanging from a hub), the process blows up because it's impossible to discern the next hop.  You can replicate this easily by changing the hostname of a switch.  The neighbors will see two entries until the old name times out.
  • The switch running the trace communicates directly with each device in the path.  The first example involved:
    • An exchange between the 2960 and the 3550
    • An exchange between the 2960 and the 2950
  • The process requires a service running on UDP 2228 on each switch (see it with 'show ip sockets')
  • If there's no CDP neighbor on a port (like when I switched CDP off on the 2950), then that's the end of the trace.
The wire format is undocumented as far as I can tell.  The wireshark wiki page for CDP mentions the protocol, but doesn't have any information on it.  It's not CDP, but it's similar.  Each packet seems to have some fixed fields, and some TLV sets.

Here's the payload breakdown of a query packet:
00-01: 02 01  (unknown)
02-03: Length (same as UDP payload length)
04-05: 05 02 (unknown - everything else looks like a TLV set)
07-12: Source MAC address

It includes the following TLV sets:
01 Destination MAC (always 8 bytes)
03 VLAN ID (always 4 bytes)
0E Originator CDP managemet IP (always 6 bytes)
10 CDP name info source (I learned about you from - variable length)

A reply packet looks like this:
00-01: 04 01 :  (unknown)
02-03: Length (same as UDP payload length)
04   : 06 : (unknown)

It includes the following TLV sets:
04 Originator CDP name (variable length)
05 Originator CDP Platform string (variable length)
06 Originator CDP management IP (always 6 bytes)
0F Unknown, 1 byte, seems to be related to "end of path" information
03 Next hop CDP IP (always 6 bytes)
10 Next hop CDP name string (variable length)
07 Egress interface name (variable length)
08 Ingress interface name (variable length)

It's very surprising to me that mapping out an L2 environment can be done using L3 (off subnet) tools with seemingly no security.  I'm not much of a believer in security by obscurity (I generally run CDP on edge ports), but this level of network mapping without even requiring an SNMP read-only string seems like it could be a problem.  The only hint of a complicating factor here is that the name of the target switch is embedded in the request packet.  If that name is checked before a reply is sent, there's some small measure of security.  But all an attacker needs is the name of a single switch, since that first switch will give up the names of all of his neighbors.

It will be difficult to strike the balance between security and usability when writing ACLs for this service, since you need to protect every IP interface on an L3 switch, while still providing service to clients on every IP interface of every L2/L3 switch.

Saturday, September 18, 2010

OSPF: IOS and NX-OS interoperability

I got stung by a nasty OSPF interoperability problem last night.  IOS devices and NX-OS devices follow different rules when installing AS-external (type 5) LSAs into their routing tables.

Consider the following example.  Four routers, in two areas.  Two routers are ASBRs, one in each area.

Both ASBRs are advertising identical LSAs for an external route to the 10/8 network.  These LSAs are flooded to all four routers in the domain so that each router may make the best decision about how to reach 10/8.

IOS routers follow the rules laid out by RFC1583.  Both LSAs are external type 1, so the metric stamped on the LSA (5 in both cases) is added to the cost to reach the ASBR.  The path with the lowest cost will be installed in the routing table.

Router B has two choices:
  • The path through "A" starts with 5, and includes a cost of 64 to cross a T1 link.  Total cost: 69
  • The path through "D" starts with 5, and includes two Ethernet segments (10 each).  Total cost: 25
So, router B will send packets for 10/8 to Router C.

Router C has two choices:
  • The path through "A" starts with 5, includes a T1 (64) and an Ethernet (10).  Total cost: 79
  • The path through "D" starts with 5, and includes a single Ethernet segment (10).  Total cost: 15
Router C will send packets for 10/8 to router D.

The network is converged, life is good.  Everybody routes 10/8 towards "D".  Except for "A" who will use his non-OSPF route.

But what if Router C is a Nexus 7000?  Nexus 7000s running NX-OS follows(ish) a different set of rules when considering AS-external paths.  They're laid out by RFC2328:

o   Intra-area paths using non-backbone areas are always the most preferred.
For Router C, the path through "A" is an intra-area non-backbone path.  The path through "D" is a backbone path.  So, when following these new rules, "C" will forward 10/8 to "B" (toward A).  ...and "B" will forward to "C" (toward "D").  An OSPF routing loop.  Whee!

Fortunately, there's a switch to control this behavior.  RFC2178 introduced a switch called 'RFC1583Compatibility'.  Routers with compatibility mode enabled ignore the "always prefer intra-area non-backbone" path business, and just use the classic OSPF cost-based decision making.

On NX-OS you configure compatibility thusly:
N7K(config)# router ospf <tag>
N7K(config-router)# rfc1583compatibility 
UNfortunately, the default setting for this switch is enabled on IOS devices, and disabled on NX-OS. The RFC is clear that these switches need to match on all routers in a domain, and that the switch should (not SHOULD, not MUST) be enabled by default (section C.1)

Even worse, the compatibility mode is completely undocumented for NX-OS.
(update 11/7/2010:  the compatibility switch is now documented in the October 2010 command reference.  Not that you'd find it there -- the document is essentially a dictionary.  It is still missing from the October 2010 Unicast Routing Configuration Guide)

Gah.  This was really frustrating.  Choosing to defy both the RFC (which calls for compatibility by default), and interoperability with the rest of your product line, without a shred of documentation is, well...  An interesting choice.

Sunday, September 12, 2010

Amazon EC2 IPsec tunnel to Cisco IOS router

Amazon recently introduced a new EC2 "micro" instance:  613MB of memory and a burstable ~2GHz slice of a single processor core for as little as $0.007 per hour.  Cheap!

So, I spun one up and started playing with it.

The first thing I wanted to do was get an IPsec tunnel to my home edge router working.  It turned out to be trickier than I'd expected.  Here's the whole process:

Launch an Instance
  • Click the "Launch Instance" button
  • Choose "Basic Fedora Core 8"
  • Set "Micro" instance type
  • Download a new SSH key (or use an existing one)
  • Configure a security group (this is the firewall service) like this:

Configure OpenSwan on the EC2 Instance
  • Connect to the instance using the directions found here.
  • Install IPsec packages:
yum -y update
yum -y install openswan openswan-doc ipsec-tools bind
  • Set some variables that will be useful later
# The private IP address assigned to your EC2 instance.
EC2PRIVATE=`ifconfig eth0|grep Bcast|cut -d: -f 2|cut -d\  -f 1`

# The public IP address assigned to your EC2 instance.

# The public IP address of the home router

# The private address space in use at home

# A secret key, created here using dns-keygen

  •  Configure the 'home' openswan connection.  The leading whitespace is important here.
echo "conn home" > /etc/ipsec.d/home.conf
echo "  left=%defaultroute" >> /etc/ipsec.d/home.conf
echo "  leftsubnet=$EC2PRIVATE/32" >> /etc/ipsec.d/home.conf
echo "  leftid=$EC2PUBLIC" >> /etc/ipsec.d/home.conf
echo "  right=$HOMEPUBLIC" >> /etc/ipsec.d/home.conf
echo "  rightid=$HOMEPUBLIC" >> /etc/ipsec.d/home.conf
echo "  rightsubnet=$HOMEPRIVATENET" >> /etc/ipsec.d/home.conf
echo "  authby=secret" >> /etc/ipsec.d/home.conf
echo "  ike=aes128-sha1-modp1024" >> /etc/ipsec.d/home.conf
echo "  esp=aes128-sha1" >> /etc/ipsec.d/home.conf
echo "  pfs=yes" >> /etc/ipsec.d/home.conf
echo "  forceencaps=yes" >> /etc/ipsec.d/home.conf
echo "  auto=start" >> /etc/ipsec.d/home.conf
chmod 600 /etc/ipsec.d/home.conf

  •  Configure the 'home' preshared key:
echo "$EC2PUBLIC $HOMEPUBLIC: PSK \"$PSK\"" > /etc/ipsec.d/home.secrets
chmod 600 /etc/ipsec.d/home.secrets
Configure the IOS end of the tunnel
The variables collected above are italicized here.  When you need to do some variable substitution in the IOS configuration, pop back into your amazon shell window and echo the variable out.  Like this:
echo $PSK
Here's the IOS configuration I'm using:
crypto isakmp policy 20
 encr aes
 authentication pre-share
 group 2
 lifetime 86400
crypto isakmp key PSK address EC2PUBLIC no-xauth
crypto ipsec security-association lifetime seconds 1800
ip access-list extended AMAZON-CRYPTO-ACL
 permit ip any host EC2PRIVATE
crypto ipsec transform-set AMAZON-TRANSFORM-SET esp-aes esp-sha-hmac

crypto map INTERNET-CRYPTO 10 ipsec-isakmp
 description Amazon EC2 instance
 set peer EC2PUBLIC
 set transform-set AMAZON-TRANSFORM-SET
 set pfs group2
 match address AMAZON-CRYPTO-ACL

Start openswan on the EC2 instance
The following commands prepare the ipsec service boot scripts, and then manually start the service:
chkconfig ipsec on
service ipsec start
That's it!  Now I can ping the private ($EC2PRIVATE) address of the EC2 instance from one of my internal machines at home.  This works in my environment because the 10.x.x.x address assigned by Amazon happens to fall within the default route in use by my home gateway.  You may need to add a static route if you're pushing the 10/8 block elsewhere in your environment.

Being able to talk securely to the private address is preferable to using the public one because of applications (SIP, FTP) that embed IP address information into their application payload.  These don't NAT well, and now they don't have to.

If you want to be able to talk securely to the public address of an EC2 instance, that can probably be done with a dummy interface on the EC2 end.  I'll work on that later.

Monday, September 6, 2010

Link Aggregation, Load Balancing and Redundancy

Link Aggregation is a pretty easy to grasp technology:  Take many links, bundle them into a single logical link.  No mystery there.  When a frame is destined to traverse the aggregated link the switch needs to make a decision: Which member link of the aggregation should be used?

While it might seem convenient, the switch must not round-robin the frames across the members of the aggregation, because that could lead to packets belonging to a particular flow arriving at their destination in a different order than they were sent by the originating station.  Out-of-order delivery can be a big problem for some flows, so that sort of behavior is explicitly forbidden.   Ordered frame delivery is one of the hard invariants of LAN bridging behavior, and is codified by ISO/IEC Standard 15802-1.  The switch employs a deterministic algorithm to ensure that every frame belonging to a given flow crosses the same link.  This way, frames that need to stay in order aren't racing each other down parallel paths.  Note here that there is no requirement for the same link to be used for the flow's return traffic, nor is there a requirement to use the same link selection algorithm on both ends of the aggregation.  ...In fact, there are cases where choosing the same selection algorithm is just flat wrong.  Consider the following example:

Four servers are each pushing 30Mb/s across a 4-way aggregation to a gigabit attached router.  Lets assume Switch A selects the link for each frame with a modulus operation: src_mac % link_count.

Server A: 0x0C % 4 = 0
Server B: 0x1B % 4 = 3
Server C: 0x2A % 4 = 2
Server D: 0x39 % 4 = 1

Great!  The load will balance perfectly!  Every server's traffic will traverse a different link.  What about in the other direction?  The source MAC address on every frame will be 0000.0000.005D (the router).  Link 1 (0x5D % 4 = 1) will always be selected, no matter which server the router is talking to.

So, for this to load balance nicely in both directions, we want Switch A to balance according to source MAC addresses, and we want switch B to balance according to destination MAC addresses.

Failure Scenario
Back to looking at server-generated traffic.  We're still doing link selection by modulus of the source MAC address.  If one of our links fails, the switch changes the number used in the modulus operation.  Instead of taking the modulus by 4, it takes the modulus by 3 (the new link count).

Server A: 0x0C % 3 = 0
Server B: 0x1B % 3 = 0
Server C: 0x2A % 3 = 0
Server D: 0x39 % 3 = 0

Traffic distribution with 3 links

Now we're trying to push the aggregate of all four servers (120Mb/s) across a single 100Mb/s link.  Bummer.

Obviously we don't want this to happen.  We can add a fifth link so that a single failure won't drop us into the unlucky modulus-by-3 situation.  Lets see how things look with 5 links:

Server A: 0x0C % 5 = 2
Server B: 0x1B % 5 = 2
Server C: 0x2A % 5 = 2
Server D: 0x39 % 5 = 2

Traffic distribution with 5 links

Nuts.  This is not an improvement.  4 links balances beautifully, but the balance is completely upset if we use 3 or 5 links.  2 links isn't safe because that wouldn't provide any redundancy.  What to do?  Allow me to introduce the 'lacp max-bundle' command.  With 'lacp max-bundle 4', the switch will never bring more than 4 links into the aggregation.  The fifth link in the example above will be in a standby mode.  Should one of the 4 active links fail, the standby link will be brought into the aggregation so that you'll always have 4 (not 5!) links, even after a failure.

I made these examples simple, and I stacked the deck to make a point.  I concede that the hashing algorithm doesn't work exactly as I've described.  In reality, the frames are hashed into a fixed number of buckets which themselves distributed among links.  Depending on how the chips fall, this may make the situation better or worse.  ...But it doesn't make is simpler :-)  I'm also aware that the selection algorithms can make use of more than just the MAC addresses (in fact, that ability is also codified in the ISO standard I cited -- the earlier standard would not have allowed information above layer 2 to be used for this purpose).

The general point is to consider your traffic carefully, and not use a one-size-fits-all approach to configurations.  Then test.  The problem outlined here (removing one link forces too much traffic onto a single member of the aggregation) is one I've encountered in a customer's network.