Wednesday, February 2, 2011

OSPF Going Clockwise

OSPF path selection is depends on a router finding the lowest cost path between itself and the destination network.  Path cost is the cumulative cost of interfaces in the path.  Interface cost is generally derived from interface bandwidth.

Okay, nothing new here.  But I skipped a critical detail:  Which interfaces are we talking about exactly?

What happens when neighboring routers disagree about the cost of a transit network?


The different interface types (100Mb/s and 10Mb/s) in this topology are going to cause these routers to disagree about the cost of moving a packet across the 10.0.1.0/31 network.  R1's Fast Ethernet interface cost is 1, while R2's Ethernet interface cost is 10.

In OSPF, all routers have a complete picture of the network topology, so both routers know the other router's idea of the cost:


R1#show ip ospf database router 10.0.0.2

OSPF Router with ID (10.0.0.1) (Process ID 1)

Router Link States (Area 0)

LS age: 1233
Options: (No TOS-capability, DC)
LS Type: Router Links
Link State ID: 10.0.0.2
Advertising Router: 10.0.0.2
LS Seq Number: 80000005
Checksum: 0x11A5
Length: 60
Number of Links: 3

Link connected to: a Stub Network
(Link ID) Network/subnet number: 10.0.0.2
(Link Data) Network Mask: 255.255.255.255
Number of TOS metrics: 0
TOS 0 Metrics: 1

Link connected to: another Router (point-to-point)
(Link ID) Neighboring Router ID: 10.0.0.1
(Link Data) Router Interface address: 10.0.1.1
Number of TOS metrics: 0
TOS 0 Metrics: 10

Link connected to: a Stub Network
(Link ID) Network/subnet number: 10.0.1.0
(Link Data) Network Mask: 255.255.255.254
Number of TOS metrics: 0
TOS 0 Metrics: 10
R2#show ip ospf database router 10.0.0.1

OSPF Router with ID (10.0.0.2) (Process ID 1)

Router Link States (Area 0)

LS age: 1269
Options: (No TOS-capability, DC)
LS Type: Router Links
Link State ID: 10.0.0.1
Advertising Router: 10.0.0.1
LS Seq Number: 80000005
Checksum: 0x8843
Length: 60
Number of Links: 3

Link connected to: a Stub Network
(Link ID) Network/subnet number: 10.0.0.1
(Link Data) Network Mask: 255.255.255.255
Number of TOS metrics: 0
TOS 0 Metrics: 1

Link connected to: another Router (point-to-point)
(Link ID) Neighboring Router ID: 10.0.0.2
(Link Data) Router Interface address: 10.0.1.0
Number of TOS metrics: 0
TOS 0 Metrics: 1

Link connected to: a Stub Network
(Link ID) Network/subnet number: 10.0.1.0
(Link Data) Network Mask: 255.255.255.254
Number of TOS metrics: 0
TOS 0 Metrics: 1


So what's the cost of this link?  As it turns out, that's asking the wrong question.  Links don't have costs in OSPF.  OSPF deals in terms of interface transmit cost only, and gives no consideration to the cost associated with receiving a packet on an interface.

Both routers in the topology will agree that an Eastbound packet on this link accrues a cost of 1, while a Westbound packet accrues a cost of 10.



This can lead to some seriously bad decision making on OSPF's part.  Assume that OSPF's auto cost reference bandwidth has been set 1Gb/s here:


R1 will evenly balance northbound traffic across R2 and R3 here, in spite of R3's reduced capacity, because R1's cost to transmit onto this multiaccess network is the same regardless of the next hop router's interface bandwidth.

Now, I'm a staunch supporter of Ethernet autonegotiation (It works!  Use it!  Don't force things!), except in cases where reducing the link speed to 100Mb/s will hurt.  This is one of them.  Issuing 'speed 1000' on either end of this link is enough to cause the R3 link to fail rather than downspeed to 100Mb/s when it has a problem on pin 4, 5, 7 or 8.



Here's another fun example.

Every OSPF route in this network goes clockwise.  The path from R1 to R4's loopback includes 3 hops through 10Mb/s interfaces.  It's clearly preferable for R1 to use 192.168.41.4 as the next hop for R4's loopback, but OSPF doesn't consider it.

R1#show ip route ospf
O 192.168.23.0/24 [110/2] via 192.168.12.2, 00:14:19, FastEthernet0/0
O 192.168.34.0/24 [110/3] via 192.168.12.2, 00:14:19, FastEthernet0/0
192.168.0.0/32 is subnetted, 4 subnets
O 192.168.0.2 [110/2] via 192.168.12.2, 00:14:19, FastEthernet0/0
O 192.168.0.3 [110/3] via 192.168.12.2, 00:14:19, FastEthernet0/0
O 192.168.0.4 [110/4] via 192.168.12.2, 00:14:19, FastEthernet0/0


OSPF considers transmit cost only. 

Tuesday, February 1, 2011

Load Balance Until It Hurts

A few months ago, Ethan wrote a great article detailing a basic network design consisting of layer-3 distribution switches and layer-2 access switches.

Included in that design is the common strategy of splitting STP root bridge duty between the two distribution switches in order balance traffic across access layer uplinks:  Distribution switch A is configured as STP root for odd-numbered VLANs, and STP root for even-numbered VLANs goes to distribution switch B.

In the comments to Ethan's article I said that, while the strategy is perfectly valid and commonly implemented, it's definitely not a one-size-fits-all scenario.  About half of my customers choose to not balance traffic in this manner because they:
  • Can easily pay for twice the bandwidth they truly require
  • Expect required bandwidth to be available at all times, even during failures
  • Don't want to be surprised by a sudden reduction in available bandwidth

Balancing VLANs across uplinks will result in reduced transit capacity during a failure.  Some environments would rather take advantage of extra capacity when it's available (most of the time), while others demand consistent network behavior.

Ethan didn't mention it, but the FHRP mechanism usually gets balanced in lockstep with the STP root.  The minimally-articulated result looks something like this:



When discussing this design, lots of network folks will tell you something like:
You have to put your HSRP primary and STP root on the same box!
My ears perk right up when a design detail makes the transition from "It's a good idea to X because of Y" into "You have to Z."

"Have to?"  There's no shortage of engineers who believe this is an absolute requirement.  Sure, it's nice, but is it required?  Meh.  This particular bit of dogma has perplexed me for a while.  I think the goal here relates to an attempt to avoid the extra east/west bridging hop between distribution switches.  It's a worthy goal, but it:
  1. Only impacts outbound traffic.  Inbound traffic has a 50/50 chance of being routed in by the "wrong" distribution switch, and crossing the east/west link anyway.  Given the inbound-heavy traffic patterns on a typical desktop LAN, it seems like misplaced focus.
  2. Guarantees the appearance of the nasty asymmetric-routing / unknown-unicast-flooding problem:  Where will the "wrong" distribution switch bridge your traffic if he never hears from you?  The fix is easy, but rarely implemented.  This bugger can be a much bigger problem than abusing the cross-link, especially in a mixed-speed environment with shared ASICs and buffers.
I'm not saying "don't balance your traffic", nor am I saying "don't align STP and FHRP."  I'm saying: "Know your network, know your traffic, and question authority."

Okay, having covered these facets of a popular design, it's time to get to the point.  When does load balancing start to hurt?  It hurts when we decide to use GLBP instead of HSRP or VRRP.  GLBP is sexy because it allows multiple active gateway routers.  In a small office with just two routers and two WAN links this is great:  Outbound user traffic can get proportionally balanced across WAN links very simply.  But what about in our ECMP campus?
  • There's no load sharing advantage because that workload typically gets balanced by distributing HSRP priority among VLANs.
  • There's no Distribution->Core advantage because CEF on the (already balanced) distribution switches will balance upstream traffic.
There are real L2/L3 enterprise networks like this out there.  They have carefully-groomed STP root bridges using carefully-prioritized GLBP.  The GLBP priorities are tweaked to coordinate them with the preferred STP root.  GLBP preemption is enabled to make darn sure that the intended switch is the live Active Virtual Gateway.  Like this:




Distribution ADistribution B

interface vlan 11
 glbp 0 preempt
interface vlan 12
 glbp 0 priority 90
interface vlan 13
 glbp 0 preempt
interface vlan 14
 glbp 0 priority 90

interface vlan 11
 glbp 0 priority 90
interface vlan 12
 glbp 0 preempt
interface vlan 13
 glbp 0 priority 90
interface vlan 14
 glbp 0 preempt

This hurts here because it runs directly counter to the load-balancing, hop-avoiding philosophy underpinning this design.  The only thing that's being tuned here is the location of the GLBP AVG (the box that answers ARP queries.)  Forwarding workload gets split between distribution switches.  50% of outbound traffic (the only thing we can control easily) is now forced to make the extra hop between distribution switches.  The priority and preemption tuning here just amounts to extra typing.  It really doesn't make any difference which switch answers ARP queries in this scenario, and the design guarantees higher latency and link utilization than the more common HSRP configuration.

Sure, it works fine.  But it's a lot of extra typing, and results in ever-so-slightly worse performance.