Wednesday, December 19, 2012

Quick-n-dirty 'network' statement generator

I recently had a requirement to feed a lab router a bunch of BGP prefixes for some testing. The test required a non-overlapping, random-looking population of prefixes from several different eBGP neighbors.

I decided to bang together a little script to generate the prefixes in a format suitable for quagga's bgpd daemon, but wound up adding features throughout the day as I discovered new requirements. The end result was something that might be useful to somebody else.

Here I'm telling it to give me 5 network statements from within The netmask will be 24 bits:
$ ./ -c 5 -n -m 24
Or I can have a range of bitmasks, this time the networks are packed much tighter and they'll be masked with 26 to 30 bits:
$ ./ -c 5 -n -m 26-30
Note however that the output above has a problem. falls within The -A flag (Avoid collisions) makes sure that doesn't happen:
$ ./ -c 5 -n -m 26-30 -A
In the cases where I'm adding network statements to some prior bgpd config file, I want to be able to avoid collisions with the existing configuration. Enter the -p flag, which reads in a prior bgpd configuration file and avoids collisions with any existing 'network' statements:
$ ./ -c 5 -n -m 26-30 -A -p /usr/local/etc/bgpd.conf
Finally, there's a flag to throw a route-map (or other text) onto the output:
$ ./ -c 5 -n -m 26-30 -A -s " route-map RM_MYMAP"
 network route-map RM_MYMAP
 network route-map RM_MYMAP
 network route-map RM_MYMAP
 network route-map RM_MYMAP
 network route-map RM_MYMAP
It's not very elegant, nor fast, but it got me through a problem and seemed worth sharing. Generating 1000 verified unique prefixes takes about 20 seconds on my laptop. The 2nd 1000 prefixes takes 100 seconds. And it gets slower from there. I'm sure I could do something speedier than using NetAddr::IP's within operator (over and over and over), but that will have to wait until I need to simulate a lot more unique addresses.

The script is here.

Friday, July 27, 2012

Native VLAN - Some Surprising Results

I did some fiddling around with router-on-a-stick configurations recently and found some native VLAN behavior that took me by surprise.

The topology for these experiments is quite simple, just one router, one switch, and a single 802.1Q link interconnecting them:
Dead Simple Routers-On-A-Stick Configuration

The initial configuration of the switch looks like:

vlan 10,20,30
interface FastEthernet0/34
 switchport trunk encapsulation dot1q
 switchport trunk allowed vlan 10,20,30
 switchport mode trunk
 spanning-tree portfast trunk
interface Vlan10
 ip address
interface Vlan20
 ip address
interface Vlan30
 ip address

And the initial configuration of the router looks like:

interface FastEthernet0/0
 no ip address
 duplex auto
 speed auto
interface FastEthernet0/0.10
 encapsulation dot1Q 10
 ip address
interface FastEthernet0/0.20
 encapsulation dot1Q 20
 ip address
interface FastEthernet0/0.30
 encapsulation dot1Q 30
 ip address

So, nothing too interesting going on here. The devices can ping each other on each of their three IP interfaces.

We can switch VLAN tagging off for one of those VLANs, by making it the native VLAN for the link. On the switch we'll do:

S1(config-if)#switchport trunk native vlan 20

And on the router we have two choices:

1) Eliminate the FastEthernet0/0.20 subinterface altogether, and apply the IP to the physical interface:

no interface FastEthernet 0/0.20
interface FastEthernet 0/0
  ip address

2) Use the native keyword on the subinterface encapsulation configuration:

interface FastEthernet0/0.20
  encapsulation dot1Q 20 native

Both configurations leave us with full connectivity on all three subnets, with the network's packets running untagged on the link. So, what's the difference between them?

The Cisco 360 training materials have this to say about the encapsulation of 802.1Q router subinterfaces:
you can assign any VLAN to the native VLAN on the router and it will make no difference
But the truth is a little more nuanced than that.

First of all, option 1 makes it impossible to administratively disable just the interface. Shutting this down will knock all of the subinterfaces offline as well (thanks, mellowd!)

To see the next difference between these router configuration options, let's introduce a misconfiguration by omitting switchport trunk native vlan 20 from the switch interface.

If we've configured option 1 on the router, pretty much nothing works. Frames sent by the switch with  VLAN 20 tag are ignored when they arrive at the router, and untagged frames sent by the router are similarly ignored by the switch.

On the other hand, if wev'e configured option 2 on the router, things kind of start to work with the mismatched trunk configuration. A ping from the switch doesn't succeed, but elicits the following response from 'debug ip packet' on the router:

IP: tableid=0, s= (FastEthernet0/0.20), d= (FastEthernet0/0.20), routed via RIB
IP: s= (FastEthernet0/0.20), d= (FastEthernet0/0.20), len 100, rcvd 3
IP: tableid=0, s= (local), d= (FastEthernet0/0.20), routed via FIB
IP: s= (local), d= (FastEthernet0/0.20), len 100, sending

The mistakenly tagged packet was accepted by subinterface Fa0/0.20! So, router configuration option #2 does something pretty interesting: It causes the subinterface to send untagged frames, just like the 'native' function suggests, but it allows the subinterface to receive either untagged frames, or frames tagged with the number specified in the subinterface encapsulation directive. Clearly the number we use makes a difference after all, and I found the behavior pretty interesting.

Catalyst switches have a similarly interesting feature: vlan dot1q tag native

When this command is issued in global configuration mode, it causes the switch to tag all frames egressing a trunk interface, even those belonging to the native VLAN. The curious thing about this command is that it causes the interface to accept incoming frames into the native VLAN whether they're tagged, or untagged. It's kind of the reverse of the curious router behavior, and leads to an interesting interoperability mode in which:
  • The router generates untagged frames, but will accept either tagged or untagged frames using:
    • interface FastEthernet0/0.20
    •   encapsulation dot1Q 20 native
  • The switch generates tagged frames, but will accespt either tagged or untagged frames using:
    • vlan dot1q tag native
    • int FastEthernet 0/34
    •   switchport trunk native vlan 20
And the result is that we have full interoperability between a couple of devices that can't agree on whether an interface should be tagged or not, because they both have liberal policies regarding the type of traffic they'll accept.

Here's the result (three pings) according to my sniffer:

02:04:05.284217 00:0b:fd:67:bf:00 > 00:07:50:80:80:81, ethertype 802.1Q (0x8100), length 118: vlan 20, p 0, ethertype IPv4, > ICMP echo request, id 36, seq 0, length 80
02:04:05.287826 00:07:50:80:80:81 > 00:0b:fd:67:bf:00, ethertype IPv4 (0x0800), length 114: > ICMP echo reply, id 36, seq 0, length 80
02:04:05.288391 00:0b:fd:67:bf:00 > 00:07:50:80:80:81, ethertype 802.1Q (0x8100), length 118: vlan 20, p 0, ethertype IPv4, > ICMP echo request, id 36, seq 1, length 80
02:04:05.292322 00:07:50:80:80:81 > 00:0b:fd:67:bf:00, ethertype IPv4 (0x0800), length 114: > ICMP echo reply, id 36, seq 1, length 80
02:04:05.292904 00:0b:fd:67:bf:00 > 00:07:50:80:80:81, ethertype 802.1Q (0x8100), length 118: vlan 20, p 0, ethertype IPv4, > ICMP echo request, id 36, seq 2, length 80
02:04:05.296703 00:07:50:80:80:81 > 00:0b:fd:67:bf:00, ethertype IPv4 (0x0800), length 114: > ICMP echo reply, id 36, seq 2, length 80

The pings succeeded, so we know that we have bidirectional connectivity. Nifty.

Friday, July 20, 2012

New 10Gb/s Interconnect Options

Followers of this blog, or folks who've heard me on the Packet Pushers Podcast may have noticed that I obsessively look for less expensive ways to interconnect data center devices.

That's because the modules are so expensive! A loaded Nexus 7010 is intimidating enough with it's $473,000 list price, but that's without any optic modules...

If we want to link those interfaces to something with 10GBASE SR modules then triple the budget because the optics cost twice as much as the switch:

$1495 / module * 8 blades * 48 links / blade * 2 modules / link = $1,148,160

By comparison, purchasing the same amount of links in 5m Twinax form comes in under $100,000.

So, there tend to be a lot of TwinAx and FET modules in my designs, and the equipment is located carefully to ensure that ~$200 TwinAx links can be used rather than ~$3,000 fiber links.

The tradeoff comes when you put a lot of 5m Twinax cables into one place and quickly find they're not much fun to work with because they're not bendy and because they're thick. That's why I'm so interested in something I noticed yesterday on Cisco's 10GBASE SFP+ Modules Data Sheet:

No word on pricing yet, but "cost-effective" sounds promising since they're positioning this cable directly alongside the Twinax option. Also no word on support by various devices. My hope is that systems which currently support only passive TwinAx cables will be able to use the longer 7m and 10m AOCs.

Also new to the data sheet are some 1.5m, 2m and 2.5m TwinAx cables. Nothing groundbreaking here, but these will be nice options to have. It's annoying to deal with a 3m cable when 1m won't quite reach from one rack to the next.

Monday, June 18, 2012

Nexus 7004 - WTF?

Where's The Fabric?

Nexus 7004 photo by Ethan Banks
On display at Cisco Live in San Diego was a new Nexus 7000 chassis, the 4-slot Nexus 7004.

The first thing that I noticed about this chassis is that it doesn't take the same power supplies as its 9, 10, and 18 slot siblings. According to the guy at the Cisco booth, the 7004 uses 3KW power supplies that will live in the four small slots at the bottom of the chassis.

The next thing I looked for were the fabric modules. Imagine my surprise to learn that the 7004 doesn't have fabric modules!

No fabric modules? This is pretty interesting, but before we get into the specifics, a quick recap of how the larger Nexus 7K chassis work is probably in order...

The Nexus 7000 uses a 3 stage fabric where frames/packets moving from one line card to another pass through three distinct fabric layers:
  1. Fabric on the ingress line card - The fabric here connects the various replication engine or switch-on-chip (SoC) ASICs together. Traffic that ingresses and egresses on the same line card doesn't need to go any further than this fabric layer. Traffic destined for a different card or the supervisor progresses to:
  2. Chassis fabric - There are up to five "fabric modules" installed in the chassis. These modules each provide either 46Gb/s or 110Gb/s to the line cards, depending on which card we're talking about.
  3. Fabric on the egress line card - This is the same story as the ingress card's fabric, but our data is hitting this fabric from the chassis fabric, rather than from a front panel port.
Proxy-routed packets in an F1/M1 chassis take an even more complicated path, because they have the privilege of hitting the fabric twice, as they traverse up to 3 different line cards. 5 fabric hops in total.

The interconnections between line cards and fabric modules look something like the following:
Nexus 7009/7010/7018 with fabric modules

Note that I'm not completely sure about the speed of the interconnection between supervisor and fabric (black lines), but it doesn't really matter because control plane stuff is pretty low bandwidth and aggressively throttled (CoPP) anyway.

The important takeaways from the drawing above are:

  1. Fabric 2 modules can run at M1/F1 speeds (46Gb/s per slot), and at F2 speeds (110 Gb/s per slot) simultaneously. They provide the right amount of bandwidth to each slot.
  2. While there are 5 fabric modules per chassis, each card (at least the M1/F1/F2 - not sure about supervisors) has 10 fabric connections, with 2 connections to each fabric module.
It's been commonly explained that the 7004 chassis is able to do away with the fabric modules by interconnecting its pair of payload slots (two slots are dedicated for supervisors) back-to-back, and eliminating the chassis fabric stage of the 3-stage switching fabric.

...Okay... But what about control plane traffic? On the earlier chassis, the control plane traffic takes the following path:
  1. Ingress port
  2. Ingress line card fabric
  3. Chassis fabric
  4. Supervisor
With no chassis fabric, it appears that there's no way for control plane traffic to get from a line card to the supervisor. Well, it turns out that the 7004 doesn't dedicate all of the line card's fabric to a back-to-back connection.

I think that the following diagram explains how it works, but I haven't seen anything official: 8 of the fabric channels on each line card connect to the peer line card, and one channel is dedicated to each supervisor. Something like this:

Nexus 7004 - no fabric modules
Cool, now we have a card <-> supervisor path, but we don't have a full line-rate fabric connection between the two line cards in the 7004. Only 8 fabric channels are available to the data plane because two channels have been dedicated for communication with the supervisors.

F2 cards clearly become oversubscribed, because they've got 480 Gb/s of front panel bandwidth, adequate forwarding horsepower, but only 440Gb/s of fabric.

I believe that F1 cards would have the same problem, with only 184 Gb/s (eight 23 Gb/s fabric channels), but now we're talking about an F1-only chassis with no L3 capability. I'm not sure whether that is even a supported configuration.

M1 cards would not have a problem, because their relatively modest forwarding capability would not be compromised by an 8 channel fabric.

Having said all that, the oversubscription on the F2 module probably doesn't matter: Hitting 440Gb/s on the back-to-back connection would require that 44 front panel ports on a single module are forwarding only to the other module. Just a little bit of card-local switching traffic would be enough to ensure that the backplane is not oversubscribed.

Brocade Fabric Symposium

Network Fabric?
Photo and weave by Travis Meinolf
The first network vendor I met on my recent trip to San Jose (disclaimer about who paid for what here) was Brocade. Their presentations are what finally got the wheels turning in my head: MLAG is cool, but it's not going to be enough. Ethernet fabrics will be relevant to my non-technical enterprise customers.

The Cliffs Notes version of the Brocade presentation is that they make data center network switches with Ethernet ports, and because of Brocade's storied history with SAN fabrics, doing multipath bridged Ethernet fabrics is second nature for them.

Three things have stood out about the Brocade presentations that I've seen:

  1. Brocade is the only vendor I've seen who makes a point of talking about power consumed per unit of bandwidth. I presume that the numbers must be compelling, or else they wouldn't bring it up, but I have not done a comparison on this point.
  2. Per-packet load balancing on aggregate links. This is really cool, see below
  3. MLAG attachment of non-fabric switches to arbitrary fabric nodes. Also really cool, maybe I'll get around to covering this one day.
Per-packet Load Balancing on Aggregate Links
We all know that the Link Selection Algorithms (LSA) used by aggregate links (LACP, EtherChannel, bonded interfaces... Some vendors even call them trunks) choose the egress link by hashing the frame/packet header.

LSAs work this way in order to maintain ordered delivery of frames: Putting every frame belonging to a particular flow onto the same egress interface ensures that the frames won't get mixed up on their way to their destination. Ordered delivery is critical, but strict flow -> link mapping means that loads don't get balanced evenly most of the time. It also means that:
  • You may have to play funny games with the number of links you're aggregating.
  • Each flow is limited to the maximum bandwidth of a single link member.
  • Fragments of too-big IP packets might get mis-ordered if your LSA uses protocol header data

Brocade's got a completely different take on this problem, and it kind of blew my mind: They do per-packet load balancing!

The following animation illustrates why per-packet load balancing is helpful:

Pay particular attention to the two frames belonging to the green flow. Don't pay attention to the aggregation's oval icon moving around and alternating transparency. I'm a better network admin than I am an animator :)

When the first green flow frame arrives at the left switch, only one link is free, so it is forwarded on the lower link because it's available.

When the 2nd green frame arrives at the left switch, both transmit interfaces are busy, so it sits in a buffer for a short time, until the upper link finishes transmitting the blue frame. As soon as the upper link becomes available, the 2nd green frame makes use of it, even though the earlier green frame used the lower link.

This is way cool stuff. It allows for better utilization of resources, and lower congestion-induced jitter than you'd get with other vendors implementations.

In the example above, there's little possibility that frames belonging to any given flow will get mis-ordered because the packets are queued in order, and the various links in the aggregation have equal latency.

But what if the latency on the links isn't equal?

Now the two links in our aggregation are of different lengths, so they exhibit different latencies.

Just like in the previous example, the first green frame uses the lower link (now with extra latency) and the second green frame is queued due to congestion. But when the blue frame clears the upper interface, the green frame doesn't follow directly on its heels. Instead, the green frame sits in queue a bit longer (note the long gap between blue and green on the upper link) to ensure that it doesn't arrive at the right switch until after the earlier green frame has completely arrived.


Now, does the Brocade implementation really work this way? I have no idea :) Heck, I don't even know if these are cut-through switches, but that's how I've drawn them. Even if this isn't exactly how it works, the per-packet load balancing scheme and the extra delay to compensate for mismatched latency are real, and they're really cool.

The gotcha about this stuff? All member links need to terminate on a single ASIC within each switch. You're not going to be spreading these aggregate links across line cards, so these sort of aggregations are strictly for scaling bandwidth, and not useful for guarding against failure scenarios where a whole ASIC is at risk.

Monday, April 9, 2012

Ethernet Fabric - The Bulb Glows Dimly

People talking about "Ethernet Fabrics" are usually describing a scheme in which many switches are interconnected and all links area available for forwarding Ethernet frames.

Rather than allowing STP to block links until it forms a loop-free topology, Fabrics include an L2 multipath scheme which forwards frames along the "best" path between any two endpoints.

Brandon Carrol outlined the basics of an Ethernet fabric here, and his description leaves me with the same question that I've had since I first heard about this technology: What problem can I solve with it?

The lightbulb over my head began to glow during one of Brocade's presentations (pop quiz: what switch is the STP root in figure 1 of the linked document?) at the Gestalt IT Fabric Symposium a couple of weeks ago. In that session, Chip Copper suggested that a traditional data center topology with many blocked links and sub-optimal paths like this one:

Three-tier architecture riddled with downsides

might be rearranged to look like this:
Flat topology. All links are available to forward traffic. It's all fabricy and stuff.

The advantages of the "Fabric" topology are obvious:
  • Better path selection: It's only a single hop between any two Access switches, where the previous design required as many as four hops.
  • Fewer devices: We're down from 11 network devices to 6
  • Fewer links: We're down from 19 infrastructure links to 15
  • More bandwidth: Aggregate bandwidth available between access devices is up from 120Gb/s to 300Gb/s (assuming 10 Gb/s links)
If I were building a network to support a specialized, self-contained compute cluster, then this sort of design is an obvious choice.

But that's not what my customers are building. The networks in my customers' data centers need to support modular scaling (full mesh designs like I've pictured here don't scale at all, let alone modularly) and they need any-vlan-anywhere support from the physical network.

So how does a fabric help a typical enterprise?
The scale of the 3-tier diagram I presented earlier is way off, and that's why fully meshing the Top of Rack (ToR) devices looks like a viable option. A more realistic topology in a large enterprise data center might have 10-20 pairs of aggregation devices and hundreds of Top of Rack devices living in the server cabinets.

Obviously, we can't fully mesh hundreds of ToR devices, but we can mesh the aggregation layer and eliminate the core! The small compute cluster fabric topology isn't very useful or interesting to me, but eliminating the core from a typical enterprise data center is really nifty. The following picture shows a full mesh of aggregation switches with fabric-enabled access switches connected around the perimeter:
Two-tier fabric design
Advantages of this design:
  • Access switches are never more than 3 hops from each other.
  • Hop count can be lowered by running a cable
  • No choke point at the network core.
  • Scaling: The most densely populated switch shown here only uses 13 links. This can grow big.
  • Scaling: Monitoring shows a link running hot? Turn up a parallel link.
Why didn't I see this before?
Honestly, I'm not sure why it took so long to pound this fabric use case through my skull. I think there are a number of factors:
  • Marketing materials for fabrics tend to focus on the simple full mesh case, and go out of their way to bash the three-tier design. A two-tier design fabric doesn't sound different enough.
  • Fabric folks also talk a lot about what Josh O'Brien calls "monkeymesh" - the idea that we can build links all willy-nilly and have things work. One vendor reportedly has a commercial with children cabling the network however they see fit, and everything works fine. This is not a useful philosophy. Structure is good!
  • The proposed topology represents a rip-and-replace of the network core. This probably hasn't been done too many times yet :-)

Wednesday, April 4, 2012

Tech Field Day

I'm priviledged to have been invited to attend Gestalt IT's Network Field Day 3 event held in and around San Jose last week. These Field Day events are rare opportunities for social-media-enabled IT folks like me (Gestalt IT calls us delegates) to get together with the people behind the amazing products we use in our jobs.  Then we chase all of the sales and marketing people out of the room :-)

Full Disclosure
Gestalt IT covered the cost of my event-related travel, hotel room, and meals. The vendors we met (who ultimately are the ones footing the bill) didn't have any say about the list of delegates, and don't know what we're going to say about their products. Generally speaking, they make good products and are hoping that the content of our blogs and tweets will indicate that. There were high and low points, I'll cover interesting examples of both in future posts. Oh, I also came home with vendor-supplied T shirts (two), coffee mugs (three) and a handful of USB flash drives.

What is Tech Field Day?
Different people get different things out of TFD events. For me, the highlight of NFD3 was the opportunity to meet an array of interesting people, some of whom are my high tech heroes. Among the list of my co-delegates are people who, without knowing it, have been influencing my career for years (one of them for over a decade). I'd only met three of them before and didn't really know any of them prior to last week. My co-delegates (table swiped from the NFD3 page) for this event were:

Ethan Banks Packet Pushers @ECBanks
Tony Bourke The Data Center Overlords @TBourke
Brandon Carroll Brandon Carroll
Brad Casemore Twilight in the Valley of the Nerds @BradCasemore
Greg Ferro EtherealMind
Packet Pushers
Jeremy L. Gaddis Evil Routers @JLGaddis
Tom Hollingsworth The Networking Nerd @NetworkingNerd
Josh O’Brien StaticNAT @JoshOBrien77
Marko Milivojevic IPExpert
Ivan Pepelnjak @IOSHints
Derick Winkworth Cloud Toad @CloudToad
Mrs. Y. Packet Pushers @MrsYisWhy

I also got to hang out with the Gestalt IT folks who make it all happen: Stephen Foskett and Matt Simmons, a couple of amazing guys that I'm proud to know.

In addition to meeting my esteemed colleagues (giggling like a schoolgirl because I got to ride around in a limo sitting next to Ivan Pepelnjak for a couple of days), I saw some awesome technology and presentations. Stuff I'm excited about and hope to get to use at work some day. ...And that brings me to what the vendor sponsors get out of these events: Nerdy bloggers like me get exposed to their best new offerings and just might write about them or tell their friends.

Does it work? Well, I'm going to be talking about it. Heck it's almost inevitable: Anyone who has squeezed out more than a a few blog posts will tell you that having material for a dozen posts dropped in your lap is awesome. Several sponsors represent repeat business for Gestalt IT, so the exposure they get from Field Day events must make their participation worthwhile.

The sponsors I met, in the order I met them were:

Chip Copper, Brocade solutioneer (I want this title!) presented at a a related Gestalt IT event, the "Fabric Symposium" held the day before NFD3 kicked off. Chip told us all about  Brocade's Shortest Path Bridging (SPB) capability, including some nifty special sauce that sets Brocade apart. Brocade management take note: Chip was an awesome presenter, he really knows how to talk to nerds, and made a compelling case for your products. I've seen Brocade sales presentations before and was underwhelmed. Chip made all the difference. I'll be posting about it soon.

Mav Turner and Joel Dolisy from Solarwinds gave us the rundown on the latest in network management. Most of my work is project-oriented consulting, rather than the long-term care and feeding of networks, so I don't work with this sort of product on a daily basis and I won't have much to say about it as far as what's new and exciting. But the room was full of passionate Solarwinds users, so I'm sure the blogosphere will be abuzz about Solarwinds in the coming weeks.

Don Clark and Samrat Ganguly told us about NEC's OpenFlow offering. This was the only OpenFlow based product we saw, and I'm not sure that I really "get" OpenFlow: Sure, it's cool tech, and a it presents a couple of large advantages over the traditional way of doing things (stay tuned for more), but it's just so different. Those that can really take advantage of it probably are already all over openflow. My customer base, on the other hand, is mostly non-tech companies for whom the network is a means to an end. I think it will take quite a while before the perceived risks (different is scary!) will be overcome in that market.

Doug Gourlay and Andy Bechtosheim (!) talked about Arista products and product philosophy in general, their new FX series switches in particular, and their view on the direction of the industry. I'd long known that Arista made compelling products, but I can't remember the last time (before now) that I was actually excited about a switching platform.

Infineta makes data-center-sized WAN optimizers that work entirely differently from the branch office boxes most of us are accustomed to using. Making the point about just how different they are required Infineta to bring an unprecedented level of nerdy to their TFD presentation. I think there was only a single delegate who managed to survive internalize most of the math they threw at us. Short version: these are exciting products that do things which will likely never be possible to do with server-based WAN optimizers.

Cisco talked to us about several new offerings in their data center, access layer, network management, security portfolios and virtual switching portfolios. We were interested in every topic, and asked lots of questions. Unfortunately, there wasn't enough time, nor technical enough Cisco folks to address these topics at the depth we would have liked. Future TFD sponsors take note: Pick a topic and be ready to dig deep on it. Don't plan to fill the allotted time because the nerds you're inviting will find a way to take the discussion into the weeds. The presentation wasn't bad, it was just too wide and not deep enough. Several topics got cut short. I still came away with material for a few blog posts, so stay tuned.

Most of the delegates weren't familiar with Spirent, so the final presentation of NFD3 was a real eye-opener for them. I've had exposure to Spiren't Test Center device on two different occasions. Both were pre-deployment data center design validation exercises, and in both cases the Spirent test tools exposed a problem so that we could solve it before the system went live. Spirent's test boxes are spectacularly capable, powerful and precise, and support a wide spectrum of test types. The boxes are expensive to purchase, but can be rented along with a Spirent PS guy (Hi Glen!) for the sorts of short duration validation tests that most enterprises might be interested in running. They also showed off their iTest product. Everyone was excited about this, and I think you'll see lots of blogging about this product - especially since they gave us test licenses to play with.

In Summary
Gestalt IT puts together an amazing event. I sure hope I get invited to another one. If you're a potential delegate, you owe it to yourself to throw your hat into the ring. Heck, start a blog and get active on twitter if  you're not blogging and tweeting already. The community is amazing and will find surprising ways to pay back your efforts. I'm sure glad that I've done it.

If you're a vendor marketing person, have a look at the sort of exposure that TFD events get for their sponsors. I have no idea about the value proposition that Gestalt IT offers (my brain doesn't work that way), but having the nerdiest bloggers with the biggest audiences excited about your product has to be good, right? Lots of big names are repeat customers, so I can only assume that it works.

Monday, February 13, 2012

10Gb/s Server Access Layer - Use The FEX!

Several people who have read the four part 10Gb/s pricing reported that the central thesis wasn't clear enough. So, here it is again:

10Gb/s Servers? Rack your Nexus 5500 with the core switches. Connect servers to Nexus 2232s.

I know of several networks that look something like this:
Top Of Rack Nexus 5500

I think that this might be a better option:
Centralized Nexus 5500

We save lots of money on optics by moving the Nexus 5500 out of the server rack and into the vicinity of the Nexus 7000 core. Then we spend that savings on Nexus 2232s, FETs and TwinAx. These two deployments cost almost exactly the same amount.

The pricing is pretty much a wash, but we end with the following advantages:
  • The ability to support 10GBASE-T servers - I expect this to be a major gotcha for some shops in the next few months.
  • Inexpensive (this is a relative term) 1Gb/s ports at top of rack for low speed connections
  • Greater flexibility for oversubscription (these servers are unlikely to need line rate connections)
  • Greater flexibility for equipment placement (drop an inexpensive FEX where ever you need it)
  • Look at all those free ports! 5K usage has dropped from 24 ports to 8 ports each! Think of how inexpensive next batch of 10Gig racks will be if we only have to buy 2232s. And the next. And the next...
It's not immediately apparent, but oversubscription is an advantage of this design. With top-of-rack 5500, you can't oversubscribe the thing; you must dedicate a 10Gb/s port to every server whether that's sensible or not. With FEXes you get to choose: oversubscribe them, or don't.

The catches with this setup are:
  • The core has to be able to support TwinAx cables: The first generation 32-port line cards must use the long "active" cables and the M108 cards will require OneX converters which list for $200 each. And check your NX-OS version.
  • You need to manage the oversubscription.
Inter-pod (through the core) oversubscription is identical at 2.5:1 in both examples. Intra-pod oversubscription rises from 1:1 to 2.5:1 with the addition of the FEX. Will it matter? Maybe. Do you deploy applications carefully so that server traffic tends to stay in-pod or in-rack, or do you servers get installed without regard to physical location ("any port / any server" mentality), with VMware DRS moving workload around?

We can cut oversubscription in this example down to 1.25:1 for just $4000 in FETs and 16 fiber strands by adding links between the 5500 and the 2232. This is a six-figure deployment, so that should be a drop in the bucket. You wouldn't factor in the cost of the 5500 interfaces in this cost comparison because we're still using less of them than the first example..

I recognize that this topology isn't perfect for everybody, but I believe it's a better option for many networks. It Depends. But it's worth thinking about, because it might cost a lot less and be a lot more flexible in the long run.

Friday, February 10, 2012

Linux vSwitches, 802.1Q and link aggregation - putting it all together

In the process of migrating my home virtualization lab from Xen with an OpenSolaris Dom0 to a Debian GNU/Linux Dom0, I've had to figure out how to do all the usual network things in an environment I'm less familiar with.

The "usual things" for a virtualization host usually includes:
  • An aggregate link for throughput and redundancy (NIC teaming for you server folks)
  • 802.1Q encapsulation to support multiple VLANs on the aggregate link
  • Several virtual switches, or a VLAN-aware virtual switch

In this example, I'm starting with 3 VLANs:
  • VLAN 99 is a dead-end VLAN that lives only inside this virtual server. You'd use a VLAN like this to interconnect two virtual machines (so long as they'll always run on the same server), or to connect virtual machines only to the Dom0 in the case of a routed / NATed setup
  • VLAN 101 is where I manage the Dom0 system.
  • VLAN 102 is where virtual machines talk to the external network (a non-routed / non-NATed configuration)
Here's the end result:

Aggregation, Trunking and Virtual Switch Configuration Example

VLAN 101 and 102 are carried from the physical switch across a 2x1Gb/s aggregate link. Communication between the Dom0 on VLAN 101 and the DomUs on VLAN 102 must go through a router in the physical network, so that traffic can be filtered / inspected / whathaveyou.

I didn't strictly need to create logical interface bond0.99 in my Dom0 because the external network doesn't get to see VLAN 99, and the Dom0 doesn't care to see it either. I created it here (without an IP address) because it made it simple to do things the  "Debian Way" with configuration scripts, etc... I drew it with dashed lines because I believe that it's optional.

Similarly, I didn't need to create the virtual switch vlan101, there's no harm in having it there, and I might wind up with a "management" VM (say, a RADIUS server?) that's appropriate to put on this VLAN.

Here's the contents of my /etc/network/interfaces file that created this setup:

auto lo
iface lo inet loopback

auto bond0
iface bond0 inet manual
        slaves eth0 eth1
        bond-mode 802.3ad
        bond-miimon 50
        bond-xmit_hash_policy layer3+4
        bond-lacp_rate fast
        bond-updelay 500
        bond-downdelay 100

# Vlan 101 is where we'll access this server.  Also, we'll
# create a bridge "vlan101" that can be attached to xen VMs.
auto bond0.101
iface bond0.101 inet manual
auto vlan101
iface vlan101 inet static
        pre-up /sbin/ip link set bond0.101 down
        pre-up /usr/sbin/brctl addbr vlan101
        pre-up /usr/sbin/brctl addif vlan101 bond0.101
        pre-up /sbin/ip link set bond0.101 up
        pre-up /sbin/ip link set vlan101 up
        post-up echo 1 > /proc/sys/net/ipv6/conf/bond0.101/disable_ipv6
        post-up echo 0 > /proc/sys/net/ipv6/conf/vlan101/autoconf
        post-up echo 1 > /proc/sys/net/ipv6/conf/vlan101/autoconf
        post-down /sbin/ip link set vlan101 down
        post-down /usr/sbin/brctl delbr vlan101

# vlan 102 is a bridge-only vlan.  The dom0 doesn't appear on
# vlan 102, but xen VMs can be attached to it. It's attached
# to on the real network.
auto bond0.102
iface bond0.102 inet manual
auto vlan102
iface vlan102 inet manual
        pre-up /sbin/ip link set bond0.102 down
        pre-up /usr/sbin/brctl addbr vlan102
        pre-up /usr/sbin/brctl addif vlan102 bond0.102
        pre-up /sbin/ip link set bond0.102 up
        pre-up /sbin/ip link set vlan102 up
        post-up echo 1 > /proc/sys/net/ipv6/conf/bond0.102/disable_ipv6
        post-up echo 0 > /proc/sys/net/ipv6/conf/vlan102/autoconf
        post-up echo 1 > /proc/sys/net/ipv6/conf/vlan102/autoconf
        post-down /sbin/ip link set vlan102 down
        post-down /usr/sbin/brctl delbr vlan102

# vlan 99 is a bridge-only vlan.  The dom0 doesn't appear on
# vlan 99, but xen VMs can be attached to it. It goes nowhere.
auto bond0.99
iface bond0.99 inet manual
auto vlan99
iface vlan99 inet manual
        pre-up /sbin/ip link set bond0.99 down
        pre-up /usr/sbin/brctl addbr vlan99
        pre-up /usr/sbin/brctl addif vlan99 bond0.99
        pre-up /sbin/ip link set bond0.99 up
        pre-up /sbin/ip link set vlan99 up
        post-up echo 1 > /proc/sys/net/ipv6/conf/bond0.99/disable_ipv6
        post-up echo 1 > /proc/sys/net/ipv6/conf/vlan99/disable_ipv6
        post-down /sbin/ip link set vlan99 down
        post-down /usr/sbin/brctl delbr vlan99

I know, I know... I should be ashamed of myself for turning IPv6 off on my home network! It's off on some interfaces on purpose -- I don't want to expose the Dom0 on VLAN 102, for example. Autoconfiguration would do that If I didn't intervene. The good news is that figuring out exactly what knobs to turn and in what order (the order of this file is important) was the hard part. Once I have a good handle on exactly what ports/services this Dom0 is running, I'll re-enable v6 on the interfaces where it's appropriate. The network is v6 enabled, but v6 security at home is a constant worry for me. Sure, NAT isn't a security mechanism, but it did allow me to be lazy in some regards.

The switch configuration that goes with this setup is pretty straightforward. It's an EtherChannel running dot1q encapsulation and only allowing VLANs 101 and 102:

interface GigabitEthernet0/1
 switchport trunk allowed vlan 101,102
 switchport mode trunk
 switchport nonegotiate
 channel-group 1 mode active
 spanning-tree portfast trunk
interface GigabitEthernet0/2
 switchport trunk allowed vlan 101,102
 switchport mode trunk
 switchport nonegotiate
 channel-group 1 mode active
 spanning-tree portfast trunk
interface Port-channel1
 switchport trunk allowed vlan 101,102
 switchport mode trunk
 switchport nonegotiate
 spanning-tree portfast trunk

Note that I'm using portfast trunk on the pSwitch. The vSwitches could be running STP, but I've disabled that feature. The VMs here are all mine, and I know that none of them will bridge two interfaces, nor will they originate any BPDUs. For an enterprise or multitenant deployment, I'd probably be inclined to run the pSwitch ports in normal mode and enable STP on the vSwitches to protect the physical network from curious sysadmins. Are you listening VMware?

Monday, February 6, 2012

NIC Surgery

I'm building a new server for use at the house, and have a requirement for lots and lots of network interfaces. The motherboard has some PCIe-x1 connectors (really short), and I have some dual-port PCIe-x4 NICs that I'd like to use, but they don't fit.

The card in question is an HP NC380T. The spec sheet says its compatible with PCIe-x1 slots, but it doesn't physically fit. Well, it didn't anyway. I've done a bit of surgery, and now the card fits the x1 slot just fine:

Card with nibbler and kitty. I made that square notch.

Comparison with an unmolested card

Another comparison
I've since given the second card the same treatment. Both cards work fine.

I read somewhere that a 1x PCIe 1.0 slot provides up to 250MB/s. These are two-port cards that I'll be linking up at 100Mb/s, so I'm only using 20% of the bus bandwidth. The single lane bus would be a bottleneck if I ran the cards at gigabit speeds, but I expect to be fine at this speed.

Thursday, January 26, 2012

Building Nexus vPC Keepalive Links

There's some contradictory and unhelpful information out there on vPC peer keepalive configuration. This post is a bit of a how-to, loaded with my opinions about what makes sense.

What Is It?
While often referred to as a link, vPC peer keepalive is really an application data flow between two switches. It's the mechanism by which the switches keep track of each other and coordinate their actions in a failure scenario.

Configuration can be as simple as a one-liner in vpc domain context:
vpc domain <domain-id>
  peer-keepalive destination 
Cisco's documentation recommends that you use a separate VRF for peer keepalive flows, but this isn't strictly necessary. What's important is that the keepalive traffic does not traverse the vPC peer-link nor use any vPC VLANs.

The traffic can be a simple L2 interconnect directly between the switches, or it can traverse a large routed infrastructure. The only requirement is that the switches have IP connectivity to one another via non-vPC infrastructure. There may also be a latency requirement - vPC keepalive traffic maintains a pretty tightly wound schedule. Because the switches in a vPC pair are generally quite near to one another I've never encountered any concerns in this regard.

What If It Fails?
This isn't a huge deal. A vPC switch pair will continue to operate correctly if the vPC keepalive traffic is interrupted. You'll want to get it fixed because an interruption to the vPC peer-link without vPC keepalive data would be a split-brain disaster.

Bringing a vPC domain up without without the keepalive flow is complicated. This is the main reason I worry about redundancy in the keepalive traffic path. Early software releases wouldn't come up at all. In later releases, configuration levers were added (and renamed!?) to control the behavior. See Matt's comments here.

The best bet is to minimize the probability of an interruption by planning carefully, thinking about the impact of a power outage, and testing the solution. Running the vPC keepalive over gear that takes 10 minutes to boot up might not be the best idea. Try booting up the environment with the keepalive path down. Then try booting up just half of the environment.

vPC Keepalive on L2 Nexus 5xxx
The L2 Nexus 5000 and 5500 series boxes don't give you much flexibility. Basically, there are two options:
  1. Use the single mgmt0 interface in the 'management' VRF. If you use a crossover cable between chassis, then you'll never have true out-of-band IP access to the device, because all other IP interfaces exist only in the default VRF, and you've just burned up the only 'management' interface. Conversely, if you run the mgmt0 interface to a management switch, you need to weigh failure scenarios and boot-up times of your management network. Both of these options SPoF the keepalive traffic because you've only got a single mgmt0 interface to work with.
  2. Use an SVI and VLAN. If I've got 10Gb/s interfaces to burn, this is my preferred configuration: Run two twinax cables between the switches (parallel to the vPC peer-link), EtherChannel them, and allow only non-vPC VLANs onto this link. Then configure an SVI for keepalive traffic in one of those VLANs.
vPC Keepalive on L3 Nexus 55xx
A Nexus 5500 with the L3 card allows more flexibility. VRFs can be created, and interfaces assigned to them, allowing you to put keepalive traffic on a redundant point to point link while keeping it in a dedicated VRF like Cisco recommends.

vPC Keepalive on Nexus 7000
The N7K allows the greatest flexibility: use management or transit interfaces, create VRFs, etc... The key thing to know about the N7K is that if you choose to use the mgmt0 interfaces, you must connect them through an L2 switch. This is because there's an mgmt0 interface on each supervisor, but only one of them is active at any moment. The only way to ensure that both mgmt0 interfaces on switch "A" can talk to both mgmt0 interfaces on switch "B" is to connect them all to an L2 topology.

The two mgmt0 interfaces don't back each other up. It's not a "teaming" scheme. Rather, the active interface is the one on the active supervisor.

IP Addressing
Lots of options here, and it probably doesn't matter what you do. I like to configure my vPC keepalive interfaces at 169.254.<domain-id>.1 and 169.254.<domain-id>.2 with a 16-bit netmask.

My rationale here is:
  • The vPC keepalive traffic is between two systems only, and I configure them to share a subnet. Nothing else in the network needs to know how to reach these interfaces, so why use a slice of routable address space?
  • is defined by RFC 3330 as the "link local" block, and that's how I'm using it. By definition, this block is not routable, and may be re-used on many broadcast domains. You've probably seen these numbers when there was a problem reaching a DHCP server. The switches won't be using RFC 3927-style autoconfiguration, but that's fine.
  • vPC domain-IDs are required to be unique, so by embedding the domain ID in the keepalive interface address, I ensure that any mistakes (cabling, etc...) won't cause unrelated switches to mistakenly identify each other as vPC peers, have overlapping IP addresses, etc...
The result looks something like this:
vpc domain 25
  peer-keepalive destination source vrf default
vlan 2
  name vPC_peer_keepalive_169.254.25.0/16
interface Vlan2
  description vPC Peer Keepalive to 5548-25-B
  no shutdown
  ip address
interface port-channel1
  description vPC Peer Link to 5548-25-B
  switchport mode trunk
  switchport trunk allowed vlan except 1-2
  vpc peer-link
  spanning-tree port type network
  spanning-tree guard loop
interface port-channel2
  description vPC keepalive link to 5548-25-B
  switchport mode trunk
  switchport trunk allowed vlan 2
  spanning-tree port type network
  spanning-tree guard loop
interface Ethernet1/2
  description 5548-25-B:1/2
  switchport mode trunk
  switchport trunk allowed vlan 2
  channel-group 2 mode active
interface Ethernet1/10
  description 5548-25-B:1/10
  switchport mode trunk
  switchport trunk allowed vlan 2
  channel-group 2 mode active
The configuration here is for switch "A" in the 25th pair of Nexus 5548s. Port-channel 1 on all switch pairs is the vPC peer link, and port-channel 2 (shown here) carries the peer keepalive traffic on VLAN 2.

Wednesday, January 25, 2012

Hot Hot Hot FEX Fix!

Cisco Nexus 2xxx units run hot in a typical server cabinet because their short depth causes them to ingest hot air produced by servers. Exhaustive (get it?) detail here.

Until now, the best fix has been the Panduit CDE2 inlet duct for Neuxs 2000 and Catalyst 4948E:

The CDE2 works great, but has some downsides:
  • Street price according to a google shopping search is around US $400.
  • It doubles the space required for installing a FEX.
  • Post-deployment installation of the CDE2 will be disruptive - it can't be retrofitted.
Today I learned that Cisco has released their own fix for the N2K's airflow woes. The NXA-AIRFLOW-SLV= "Nexus Airflow Extension Sleeve" is currently orderable, and it lists for only $150!

I've never seen one of these buggers, but I've heard that it's only 1RU, which is nice. I don't have any other detail about it.

I hope that it will be simple to retrofit onto currently installed Fabric Extenders.

UPDATE 2/1/2012 I have some additional information about the NXA-AIRFLOW-SLV.

It's orderable, but the lead times tool indicates that it's on New Product Hold. Cisco tells me that the sleeve will have full documentation and will make an appearance in the dynamic configuration tool (as an N2K option) within a couple of weeks.

In the mean time, there's this: Temperature-Solutions-note.pdf

That document includes the following pictures of the implementation:
Duct with airflow sketch

Installation drawing

FEX with duct installed

Installing this bugger onto an existing FEX (especially one with servers mounted immediately above and below) will be an interesting exercise in problem solving, but looks possible. Power supply cables will need to be threaded through the duct before it's put into place. I wonder if the 2m power cords will be able to reach from the FEX, around the cold-side rack rail, and then all the way to the PDU in the hot aisle?

Also covered in the document is an interesting inlet duct (more of a hat) for reverse airflow FEXen (those with intake on the port end):
Inlet hat for reverse airflow FEX

This guy makes sense if the FEX is mounted flush with the rack rails (as shown above) and has no equipment installed in the space directly above it. It'd probably be easier to mount the FEX so that the intake vent protrudes beyond the mounting rail like this:
FEX standing proud of the rack mounting rail
...But this sort of mounting is usually only possible on the hot side of a cabinet. The cold side is usually pretty close to the cabinet door, and wouldn't tolerate 2" of FEX plus a couple of inches of cable protrusion. This accessory (the hat) doesn't seem to be orderable yet.

Monday, January 23, 2012

Nexus vPC Orphan Ports

"Orphan Port" is an important concept when working with a Cisco Nexus vPC configuration. Misunderstanding this aspect of vPC operation can lead to unnecessary downtime because of some of the funny behavior associated with orphan ports.

Before we can define an orphan port, it's important to cover a few vPC concepts. We'll use the following topology.

Here we have a couple of Nexus 5xxx switches with four servers attached. The switches are interconnected by a vPC peer link so that they can offer vPC (multi-chassis link aggregation) connections to servers. The switches also exchange vPC peer-keepalive traffic over an out-of-band connection.

Lets consider the traffic path between some of these servers:
This traffic takes a single hop from "A" to its destination via S1.
The path of this traffic depends on the which link the server's hashing algorithm chooses. Traffic might go only through S1, or it might take the suboptimal path through S2 and S1 (over the peer link).
The path of this traffic is unpredicatable, but always optimal. These servers might talk to each other through S1 or through S2, but their traffic will never hit the peer link under normal circumstances.
This traffic always crosses the peer link because A and D are active on different switches.

vPC Primary / Secondary - In a vPC topology (technically a vPC domain), one switch is elected primary, and the other secondary according to configurable priorities and MAC address-based entropy. The priority and role is important in cases where the topology is brought up and down, because it controls how each switch will behave in these exceptional circumstances.

vPC peer link - This link is a special interconnection between two Nexus switches which allows them to collaborate in the offering of multi-chassis EtherChannel connections to downstream devices. The switches use the peer link to "get their stories straight" and unify their presentation of the LACP and STP topologies.

The switches also use the peer link to synchronize the tables they use for filtering/forwarding unicast and muliticast frames.

The peer link is the centerpiece of the most important thing to know about traffic forwarding in a vPC environment: A packet which ingresses via the peer link is not allowed to egress a vPC interface under normal circumstances.

This means that a broadcast frame from server A will be flooded to B, C and S2 by S1. When the frame gets to S2, it will only be forwarded to D. S2 will not flood the frame to B and C.

vPC peer keepalive - This is an IP traffic flow configured between the two switches. It must not ride over the peer link. It may be a direct connection between the two switches, or it can traverse a some other network infrastructure. The peer keepalive traffic is used to resolve the dual-active scenario that might arise from loss of the peer link.

vPC VLAN - Any VLAN which is allowed onto the vPC peer link is a vPC VLAN.

Orphan Port - Any port not configured as a vPC, but which carries a vPC VLAN. The link to "A" and both links to "D" are orphan ports.

So why do orphan ports matter?
Latency: Traffic destined for orphan ports has a 50/50 chance of winding up on the wrong switch, so it will have to traverse the peer link to get to its destination. Sure, it's only a single extra L2 hop, but it's ugly.

Bandwidth: The vPC peer link ordinarily does not need to handle any unicast user traffic. It's not part of the switching fabric, and it's commonly configured as a 20Gb/s link even if the environment has much higher uplinks and downlinks. Frames crossing the peer link will incur extra header (this is how S2 knows not to flood the broadcast to B and C in the previous example) and possibly overwhelm the link. I've only ever seen this happen in a test environment, but it was ugly.

Shutdown: This is the big one. If the peer link is lost, bad things happen. The vPC secondary switch (probably the switch that rebooted last, not necessarily the one you intend) will disable all of his vPC interfaces, including the link up to the distribution or core layers. In this case, server D will be left high-and-dry, unable to talk to anybody. Will server D flip over to his alternate NIC? Most teaming schemes decide to fail over based on loss of link. D's primary link will not go down.

If the switches are layer-3 capable, the SVIs for vPC VLANs will go down too, leaving orphan ports unable to route their way out of the VLAN as well.

No Excuse
There are configuration levers that allow us to work around these failure scenarios, but I find it easier to just avoid the problem in the first place by deploying everything dual-attached with LACP. Just don't create orphan ports.

We're talking about the latest and greatest in Cisco data center switching. It's expensive stuff, even on a per-port basis. Almost everything in the data center can run LACP these days (Solaris 8 and VMware being notable exceptions), so why not build LACP links?