Monday, February 13, 2012

10Gb/s Server Access Layer - Use The FEX!

Several people who have read the four part 10Gb/s pricing reported that the central thesis wasn't clear enough. So, here it is again:

10Gb/s Servers? Rack your Nexus 5500 with the core switches. Connect servers to Nexus 2232s.

I know of several networks that look something like this:
Top Of Rack Nexus 5500

I think that this might be a better option:
Centralized Nexus 5500

We save lots of money on optics by moving the Nexus 5500 out of the server rack and into the vicinity of the Nexus 7000 core. Then we spend that savings on Nexus 2232s, FETs and TwinAx. These two deployments cost almost exactly the same amount.

The pricing is pretty much a wash, but we end with the following advantages:
  • The ability to support 10GBASE-T servers - I expect this to be a major gotcha for some shops in the next few months.
  • Inexpensive (this is a relative term) 1Gb/s ports at top of rack for low speed connections
  • Greater flexibility for oversubscription (these servers are unlikely to need line rate connections)
  • Greater flexibility for equipment placement (drop an inexpensive FEX where ever you need it)
  • Look at all those free ports! 5K usage has dropped from 24 ports to 8 ports each! Think of how inexpensive next batch of 10Gig racks will be if we only have to buy 2232s. And the next. And the next...
It's not immediately apparent, but oversubscription is an advantage of this design. With top-of-rack 5500, you can't oversubscribe the thing; you must dedicate a 10Gb/s port to every server whether that's sensible or not. With FEXes you get to choose: oversubscribe them, or don't.

The catches with this setup are:
  • The core has to be able to support TwinAx cables: The first generation 32-port line cards must use the long "active" cables and the M108 cards will require OneX converters which list for $200 each. And check your NX-OS version.
  • You need to manage the oversubscription.
Inter-pod (through the core) oversubscription is identical at 2.5:1 in both examples. Intra-pod oversubscription rises from 1:1 to 2.5:1 with the addition of the FEX. Will it matter? Maybe. Do you deploy applications carefully so that server traffic tends to stay in-pod or in-rack, or do you servers get installed without regard to physical location ("any port / any server" mentality), with VMware DRS moving workload around?

We can cut oversubscription in this example down to 1.25:1 for just $4000 in FETs and 16 fiber strands by adding links between the 5500 and the 2232. This is a six-figure deployment, so that should be a drop in the bucket. You wouldn't factor in the cost of the 5500 interfaces in this cost comparison because we're still using less of them than the first example..

I recognize that this topology isn't perfect for everybody, but I believe it's a better option for many networks. It Depends. But it's worth thinking about, because it might cost a lot less and be a lot more flexible in the long run.

Friday, February 10, 2012

Linux vSwitches, 802.1Q and link aggregation - putting it all together

In the process of migrating my home virtualization lab from Xen with an OpenSolaris Dom0 to a Debian GNU/Linux Dom0, I've had to figure out how to do all the usual network things in an environment I'm less familiar with.

The "usual things" for a virtualization host usually includes:
  • An aggregate link for throughput and redundancy (NIC teaming for you server folks)
  • 802.1Q encapsulation to support multiple VLANs on the aggregate link
  • Several virtual switches, or a VLAN-aware virtual switch

In this example, I'm starting with 3 VLANs:
  • VLAN 99 is a dead-end VLAN that lives only inside this virtual server. You'd use a VLAN like this to interconnect two virtual machines (so long as they'll always run on the same server), or to connect virtual machines only to the Dom0 in the case of a routed / NATed setup
  • VLAN 101 is where I manage the Dom0 system.
  • VLAN 102 is where virtual machines talk to the external network (a non-routed / non-NATed configuration)
Here's the end result:

Aggregation, Trunking and Virtual Switch Configuration Example

VLAN 101 and 102 are carried from the physical switch across a 2x1Gb/s aggregate link. Communication between the Dom0 on VLAN 101 and the DomUs on VLAN 102 must go through a router in the physical network, so that traffic can be filtered / inspected / whathaveyou.

I didn't strictly need to create logical interface bond0.99 in my Dom0 because the external network doesn't get to see VLAN 99, and the Dom0 doesn't care to see it either. I created it here (without an IP address) because it made it simple to do things the  "Debian Way" with configuration scripts, etc... I drew it with dashed lines because I believe that it's optional.

Similarly, I didn't need to create the virtual switch vlan101, there's no harm in having it there, and I might wind up with a "management" VM (say, a RADIUS server?) that's appropriate to put on this VLAN.

Here's the contents of my /etc/network/interfaces file that created this setup:

auto lo
iface lo inet loopback

auto bond0
iface bond0 inet manual
        slaves eth0 eth1
        bond-mode 802.3ad
        bond-miimon 50
        bond-xmit_hash_policy layer3+4
        bond-lacp_rate fast
        bond-updelay 500
        bond-downdelay 100

# Vlan 101 is where we'll access this server.  Also, we'll
# create a bridge "vlan101" that can be attached to xen VMs.
auto bond0.101
iface bond0.101 inet manual
auto vlan101
iface vlan101 inet static
        pre-up /sbin/ip link set bond0.101 down
        pre-up /usr/sbin/brctl addbr vlan101
        pre-up /usr/sbin/brctl addif vlan101 bond0.101
        pre-up /sbin/ip link set bond0.101 up
        pre-up /sbin/ip link set vlan101 up
        post-up echo 1 > /proc/sys/net/ipv6/conf/bond0.101/disable_ipv6
        post-up echo 0 > /proc/sys/net/ipv6/conf/vlan101/autoconf
        post-up echo 1 > /proc/sys/net/ipv6/conf/vlan101/autoconf
        post-down /sbin/ip link set vlan101 down
        post-down /usr/sbin/brctl delbr vlan101

# vlan 102 is a bridge-only vlan.  The dom0 doesn't appear on
# vlan 102, but xen VMs can be attached to it. It's attached
# to on the real network.
auto bond0.102
iface bond0.102 inet manual
auto vlan102
iface vlan102 inet manual
        pre-up /sbin/ip link set bond0.102 down
        pre-up /usr/sbin/brctl addbr vlan102
        pre-up /usr/sbin/brctl addif vlan102 bond0.102
        pre-up /sbin/ip link set bond0.102 up
        pre-up /sbin/ip link set vlan102 up
        post-up echo 1 > /proc/sys/net/ipv6/conf/bond0.102/disable_ipv6
        post-up echo 0 > /proc/sys/net/ipv6/conf/vlan102/autoconf
        post-up echo 1 > /proc/sys/net/ipv6/conf/vlan102/autoconf
        post-down /sbin/ip link set vlan102 down
        post-down /usr/sbin/brctl delbr vlan102

# vlan 99 is a bridge-only vlan.  The dom0 doesn't appear on
# vlan 99, but xen VMs can be attached to it. It goes nowhere.
auto bond0.99
iface bond0.99 inet manual
auto vlan99
iface vlan99 inet manual
        pre-up /sbin/ip link set bond0.99 down
        pre-up /usr/sbin/brctl addbr vlan99
        pre-up /usr/sbin/brctl addif vlan99 bond0.99
        pre-up /sbin/ip link set bond0.99 up
        pre-up /sbin/ip link set vlan99 up
        post-up echo 1 > /proc/sys/net/ipv6/conf/bond0.99/disable_ipv6
        post-up echo 1 > /proc/sys/net/ipv6/conf/vlan99/disable_ipv6
        post-down /sbin/ip link set vlan99 down
        post-down /usr/sbin/brctl delbr vlan99

I know, I know... I should be ashamed of myself for turning IPv6 off on my home network! It's off on some interfaces on purpose -- I don't want to expose the Dom0 on VLAN 102, for example. Autoconfiguration would do that If I didn't intervene. The good news is that figuring out exactly what knobs to turn and in what order (the order of this file is important) was the hard part. Once I have a good handle on exactly what ports/services this Dom0 is running, I'll re-enable v6 on the interfaces where it's appropriate. The network is v6 enabled, but v6 security at home is a constant worry for me. Sure, NAT isn't a security mechanism, but it did allow me to be lazy in some regards.

The switch configuration that goes with this setup is pretty straightforward. It's an EtherChannel running dot1q encapsulation and only allowing VLANs 101 and 102:

interface GigabitEthernet0/1
 switchport trunk allowed vlan 101,102
 switchport mode trunk
 switchport nonegotiate
 channel-group 1 mode active
 spanning-tree portfast trunk
interface GigabitEthernet0/2
 switchport trunk allowed vlan 101,102
 switchport mode trunk
 switchport nonegotiate
 channel-group 1 mode active
 spanning-tree portfast trunk
interface Port-channel1
 switchport trunk allowed vlan 101,102
 switchport mode trunk
 switchport nonegotiate
 spanning-tree portfast trunk

Note that I'm using portfast trunk on the pSwitch. The vSwitches could be running STP, but I've disabled that feature. The VMs here are all mine, and I know that none of them will bridge two interfaces, nor will they originate any BPDUs. For an enterprise or multitenant deployment, I'd probably be inclined to run the pSwitch ports in normal mode and enable STP on the vSwitches to protect the physical network from curious sysadmins. Are you listening VMware?

Monday, February 6, 2012

NIC Surgery

I'm building a new server for use at the house, and have a requirement for lots and lots of network interfaces. The motherboard has some PCIe-x1 connectors (really short), and I have some dual-port PCIe-x4 NICs that I'd like to use, but they don't fit.

The card in question is an HP NC380T. The spec sheet says its compatible with PCIe-x1 slots, but it doesn't physically fit. Well, it didn't anyway. I've done a bit of surgery, and now the card fits the x1 slot just fine:

Card with nibbler and kitty. I made that square notch.

Comparison with an unmolested card

Another comparison
I've since given the second card the same treatment. Both cards work fine.

I read somewhere that a 1x PCIe 1.0 slot provides up to 250MB/s. These are two-port cards that I'll be linking up at 100Mb/s, so I'm only using 20% of the bus bandwidth. The single lane bus would be a bottleneck if I ran the cards at gigabit speeds, but I expect to be fine at this speed.