Tuesday, December 20, 2011

Pricing and Trading Networks: Down is Up, Left is Right

My introduction to enterprise networking was a little backward. I started out supporting trading floors, backend pricing systems, low-latency algorithmic trading systems, etc... I got there because I'd been responsible for UNIX systems producing and consuming multicast data at several large financial firms.

Inevitably, the firm's network admin folks weren't up to speed on matters of performance tuning, multicast configuration and QoS, so that's where I focused my attention. One of these firms offered me a job with the word "network" in the title, and I was off to the races.

It amazes me how little I knew in those days. I was doing PIM and MSDP designs before the phrases "link state" and "distance vector" were in my vocabulary! I had no idea what was populating the unicast routing table of my switches, but I knew that the table was populated, and I knew what PIM was going to do with that data.

More incredible is how my ignorance of "normal" ways of doing things (AVVID, SONA, Cisco Enterprise Architecture, multi-tier designs, etc...) gave me an advantage over folks who had been properly indoctrinated. My designs worked well for these applications, but looked crazy to the rest of the network staff (whose underperforming traditional designs I was replacing).

The trading floor is a weird place, with funny requirements. In this post I'm going to go over some of the things that make trading floor networking... Interesting.

Redundant Application Flows
The first thing to know about pricing systems is that you generally have two copies of any pricing data flowing through the environment at any time.  Ideally, these two sets originate from different head-end systems, get transit from different wide area service providers, ride different physical infrastructure into opposite sides of your data center, and terminate on different NICs in the receiving servers.

If you're getting data directly from an exchange, that data will probably be arriving as multicast flows. Redundant multicast flows. The same data arrives at your edge from two different sources, using two different multicast groups.

If you're buying data from a value-add aggregator (Reuters, Bloomberg, etc...), then it probably arrives via TCP from at least two different sources. The data may be duplicate copies (redundancy), or be distributed among the flows with an N+1 load-sharing scheme.

Losing One Packet Is Bad
Most application flows have no problem with packet loss. High performance trading systems are not in this category.

Think of the state of the pricing data like a spreadsheet. The rows represents a securities -- something that traders buy and sell. The columns represent attributes of that security: bid price, ask price, daily high and low, last trade price, last trade exchange, etc...

Our spreadsheet has around 100 columns and 200,000 rows. That's 20 million cells. Every message that rolls in from a multicast feed updates one of those cells. You just lost a packet. Which cell is wrong? Easy answer: All of them. If a trader can't trust his data, he can't trade.

Reconvergence Is Bad
Because we've got two copies of the data coming in. There's no reason to fix a single failure. If something breaks, you can let it stay broken until the end of the day.

What's that? You think it's worth fixing things with a dynamic routing protocol? Okay cool, route around the problem. Just so long as you can guarantee that "flow A" and "flow B" never traverse the same core router. Why am I paying for two copies of this data if you're going to push it through a single device? You just told me that the device is so fragile that you feel compelled to route around failures!

Don't Cluster the Firewalls
The same reason we don't let routing reconverge applies here. If there are two pricing firewalls, don't tell them about each other. Run them as standalone units. Put them in separate rooms, even.  We can afford to lose half of a redundant feed. We cannot afford to lose both feeds, even for the few milliseconds required for the standby firewall take over. Two clusters (four firewalls) would be okay, just keep the "A" and "B" feeds separate!

Don't team the server NICs
The flow-splitting logic applies all the way down to the servers. If they've got two NICs available for incoming pricing data, these NICs should be dedicated per-flow. Even if there are NICs-a-plenty, the teaming schemes are all bad news because like flows, application components are also disposable. It's okay to lose one. Getting one back? That's sometimes worse. Keep reading...

Recovery Can Kill You
Most of these pricing systems include a mechanism for data receivers to request retransmission of lost data, but the recovery can be a problem. With few exceptions, the network applications in use on the trading floor don't do any sort of flow control. It's like they're trying to hurt you.

Imagine a university lecture where a sleeping student wakes up, asks the lecturer to repeat the last 30 minutes, and the lecturer complies. That's kind of how these systems work.

Except that the lecturer complies at wire speed, and the whole lecture hall full of students is compelled to continue taking notes. Why should the every other receiver be penalized because one system screwed up? I've got trades to clear!

The following snapshot is from the Cisco CVD for trading systems. it shows how aggressive these systems can be. A nominal 5Mb/s trading application regularly hits wire-speed (100Mb/s) in this case.

The graph shows a small network when things are working right. A big trading backend at a large financial services firm can easily push that green line into the multi-gigabit range. Make things interesting by breaking stuff and you'll over-run even your best 10Gb/s switch buffers (6716 cards have 90MB per port) easily.

Slow Servers Are Good
Lots of networks run with clients deliberately connected at slower speeds than their server. Maybe you have 10/100 ports in the wiring closet and gigabit-attached servers. Pricing networks require exactly the opposite. The lecturer in my analogy isn't just a single lecturer. It's a team of lecturers. They all go into wire-speed mode when the sleeping student wakes up.

How will you deliver multiple simultaneous gigabit-ish multicast streams to your access ports? You can't. I've fixed more than one trading system by setting server interfaces down to 100Mb/s or even 10Mb/s. Fast clients, slow servers is where you want to be.

Slowing down the servers can turn N*1Gb/s worth of data into N*100Mb/s -- something we can actually handle.

Bad Apple Syndrome
The sleeping student example is actually pretty common. It's amazing to see the impact that can arise from things like:
  • a clock update on a workstation
  • ripping a CD with iTunes
  • briefly closing the lid on a laptop
The trading floor is usually a population of Windows machines with users sitting behind them. Keeping these things from killing each other is a daunting task. One bad apple will truly spoil the bunch.

How Fast Is It?
System performance is usually measured in terms of stuff per interval. That's meaningless on the trading floor. The opening bell at NYSE is like turning on a fire hose. The only metric that matters is the answer to this question: Did you spill even one drop of water?

How close were you to the limit? Will you make it through tomorrow's trading day too?

I read on twitter that Ben Bernanke got a bad piece of fish for dinner. How confident are you now? Performance of these systems is binary. You either survived or you did not. There is no "system is running slow" in this world.

Routing Is Upside Down
While not unique to trading floors, we do lots of multicast here. Multicast is funny because it relies on routing traffic away from the source, rather than routing it toward the destination. Getting into and staying in this mindset can be a challenge. I started out with no idea how routing worked, so had no problem getting into the multicast mindset :-)

NACK not ACK
Almost every network protocol relies on data receivers ACKnowledging their receipt of data. But not here. Pricing systems only speak up when something goes missing.

QoS Isn't The Answer
QoS might seem like the answer to make sure that we get through the day smoothly, but it's not. In fact, it can be counterproductive.

QoS is about managed un-fairness... Choosing which packets to drop. But pricing systems are usually deployed on dedicated systems with dedicated switches. Every packet is critical, and there's probably more of them than we can handle. There's nothing we can drop.

Making matters worse, enabling QoS on many switching platforms reduces the buffers available to our critical pricing flows, because the buffers necessarily get carved so that they can be allocated to different kinds of traffic. It's counter intuitive, but 'no mls qos' is sometimes the right thing to do.

Load Balancing Ain't All It's Cracked Up To Be
By default, CEF doesn't load balance multicast flows. CEF load balancing of multicast can be enabled and enhanced, but doesn't happen out of the box.

We can get screwed on EtherChannel links too: Sometimes these quirky applications intermingle unicast data with the multicast stream. Perhaps a latecomer to the trading floor wants to start watching Cisco's stock price.  Before he can begin, he needs all 100 cells associated with CSCO. This is sometimes called the "Initial Image." He ignores updates for CSCO until he's got the that starting point loaded up.

CSCO has updated 9000 times today, so the server unicasts the initial image: "Here are all 100 cells for CSCO as of update #9000: blah blah blah...". Then the price changes, and the server multicasts update #9001 to all receivers.

If there's a load balanced path (either CEF or an aggregate link) between the server and client, then our new client could get update 9001 (multicast) before the initial image (unicast) shows up. The client will discard update 9001 because he's expecting a full record, not an update to a single cell.

Next, the initial image shows up, and the client knows he's got everything through update #9000. Then update #9002 arrives. Hey, what happened to #9001?

Post-mortem analysis of these kinds of incidents will boil down to the software folks saying:
We put the messages on the wire in the correct order. They were delivered by the network in the wrong order.
ARP Times Out
NACK-based applications sit quietly until there's a problem. So quietly that they might forget the hardware address associated with their gateway or with a neighbor.

No problem, right? ARP will figure it out... Eventually. Because these are generally UDP-based applications without flow control, the system doesn't fire off a single packet, then sit and wait like it might when talking TCP. No, these systems can suddenly kick off a whole bunch of UDP datagrams destined for a system it hasn't talked to in hours.

The lower layers in the IP stack need to hold onto these packets until the ARP resolution process is complete. But the packets keep rolling down the stack! The outstanding ARP queue is only 1 packet deep in many implementations. The queue overflows and data is lost. It's not strictly a network problem, but don't worry. Your phone will ring.

Losing Data Causes You to Lose Data
There's a nasty failure mode underlying the NACK-based scheme. Lost data will be retransmitted. If you couldn't handle the data flow the first time around, why expect to handle wire speed retransmission of that data on top of the data that's coming in the next instant?

If the data loss was caused by a Bad Apple receiver, then all his peers suffer the consequences. You may have many bad apples in a moment. One Bad Apple will spoil the bunch.

If the data loss was caused by an overloaded network component, then you're rewarded by compounding increases in packet rate. The exchanges don't stop trading, and the data sources have a large queue of data to re-send.

TCP applications slow down in the face of congestion. Pricing applications speed up.

Packet Decodes Aren't Available
Some of the wire formats you'll be dealing with are closed-source secrets. Others are published standards for which no WireShark decodes are publicly available. Either way, you're pretty much on your own when it comes to analysis.

Updates
Responding to Will's question about data sources: The streams come from the various exchanges (NASDAQ, NYSE, FTSE, etc...) Because each of these exchanges use their own data format, there's usually some layers of processing required to get them into a common format for application consumption. This processing can happen at a value-add data distributor (Reuters, Bloomberg, Activ), or it can be done in-house by the end user. Local processing has the advantage of lower latency because you don't have to have the data shipped from the exchange to a middleman before you see it.

Other streams come from application components within the company. There are usually some layers of processing (between 2 and 12) between a pricing update first hitting your equipment, and when that update is consumed by a trader. The processing can include format changes, addition of custom fields, delay engines (delayed data can be given away for free), vendor-switch systems (I don't trust data vendor "A", switch me to "B"), etc...

Most of those layers are going to be multicast, and they're going to be the really dangerous ones, because the sources can clobber you with LAN speeds, rather than WAN speeds.

As far as getting the data goes, you can move your servers into the exchange's facility for low-latency access (some exchanges actually provision the same length of fiber to each colocated customer, so that nobody can claim a latency disadvantage), you can provision your own point-to-point circuit for data access, you can buy a fat local loop from a financial network provider like BT/Radianz (probably MPLS on the back end so that one local loop can get you to all your pricing and clearing partners), or you can buy the data from a value-add aggregator like Reuters or Bloomberg.

Responding to Will's question about SSM:  I've never seen an SSM pricing component. They may be out there, but they might not be a super good fit. Here's why: Everything in these setups is redundant, all the way down to software components. It's redundant in ways we're not used to seeing in enterprises. No load-balancer required here. The software components collaborate and share workload dynamically. If one ticker plant fails, his partner knows what update was successfully transmitted by the dead peer, and takes over from that point. Consuming systems don't know who the servers are, and don't care. A server could be replaced at any moment.

In fact, it's not just downstream pricing data that's multicast. Many of these systems use a model where the clients don't know who the data sources are. Instead of sending requests to a server, they multicast their requests for data, and the servers multicast the replies back. Instead of:
<handshake> hello server, nice to meet you. I'd like such-and-such.
it's actually:
hello? servers? I'd like such-and-such! I'm ready, so go ahead and send it whenever...
Not knowing who your server is kind of runs counter to the SSM ideal. It could be done with a pool of servers, I've just never seen it.

The exchanges are particularly slow-moving when it comes to changing things. The modern exchange feed, particularly ones like the "touch tone" example I cited are literally ticker-tape punch signals wrapped up in an IP multicast header.

The old school scheme was to have a ticker tape machine hooked to a "line" from the exchange.  Maybe you'd have two of them (A and B again). There would be a third one for retransmit. Ticker machine run out of paper? Call the exchange, and here's more-or-less what happens:

  • Somebody at the exchange cuts the section of paper containing the updates you missed out of their spool of ticker tape.  Actual scissors are involved here.
  • Next, they grab a bit of header tape that says: "this is retransmit data for XYZ Bank".
  • They tape these two pieces of paper together, and feed them through a reader that's attached to the "retransmit line"
  • Every bank in New York will get the retransmits, but they'll know to ignore them because of the header.
  • XYZ Bank clips the retransmit data out of the retransmit ticker machine, and pastes it into place on the end where the machine ran out of paper.
These terms "tick" "line" and "retransmit", etc... all still apply with modern IP based systems. I've read the developer guides for these systems (to write wireshark decodes), and it's like a trip back in time. Some of these systems are still so closely coupled to the paper-punch system that you get chads all over the floor and paper cuts all over your hands just from reading the API guide :-)

23 comments:

  1. It is my favorite kind of networking :)

    ReplyDelete
  2. Chris,

    Thanks a lot. That was a great post. I've only heard that trade networking is a whole new world. I've never had a decent explanation as to why.

    Where do these streams come from? WAN circuits? Point to point circuits? Internet circuits? Local to the datacenter?

    Does the trade floor use PIM-SM or have they upgraded to SSM? I'd expect multicast to be an afterthough if you were using SSM.

    ReplyDelete
  3. One of the coolest things I've read in awhile.

    ReplyDelete
  4. Great post, many thanks for the insight into this "strange" world!

    ReplyDelete
  5. I've never encountered a PGM-based pricing application. I'd guess that they're out there, especially since someone from Tibco co-authored RFC 3208. But I've never seen one.

    ReplyDelete
  6. Chris,

    Your post provides great insight into pricing and trading networks. Your point about QoS as the choice of which packets to drop is very appropriate. QoS is not a magical bandwidth fairy.

    Nice job. I'll be subscribing to your RSS feed.

    Jeff L.

    ReplyDelete
  7. Hey Jeff,

    Thanks for the compliment.

    ReplyDelete
  8. What about FIX protocol?

    ReplyDelete
  9. SSM would be a nice move forwards for some trading applications, as the tree is built permamently from the subscriber and isn't timer based. We've seen very complicated issues with different (S,G) timeout values between 6500s and ASRs, inactive sources and MSPD meshes. (When an inactive source goes active again it's possible some of the PIM tree still exists and some of it doesn't so the entire state doesn't get rebuilt until PIM join refreshes get sent).

    Yes, you do end up needing to build a "software RP" for source discovery, and this software RP would need to timeout inactive sources. But this timer exists only in one place, rather than on each and every forwarding device, which to my mind makes the implmentation more robust.

    ReplyDelete
  10. "What about FIX protocol?"

    FIX is a TCP-based protocol used for executing trades. Totally different animal.

    The "pricing" stuff includes the whole back-and-forth between all buyers and sellers in the market. Bids and offers can change all day without resulting in a single actual trade. And you have to watch all of that back-and-forth.

    FIX, on the other hand, only includes the actual trades communicated between you and your broker (or whoever).

    So, while it's important, and one of the areas where you'd want to stamp out latency, it's not hard or interesting to support FIX. It's just a TCP flow.

    ReplyDelete
  11. @Sam Stickland

    SSM would be nice, but requires either:
    - an SSM enabled application (I've not encountered one in the finance world)
    - mapping of ASM -> SSM in the routers. This is possible, but requires router configuration to stay up-to-date with application changes. This can be a challenge.

    Most big financial shops are not interested in adding additional layers (SSM mapping) where they might make a mistake.

    I've run 6500s (sup720) and 4500s (supV-10GE) with many thousands (S,G) without any problem related to PIM or TCAM programming.

    ReplyDelete
  12. @chris Ack, I've never seen an SSM application in the finance world either. That doesn't mean that SSM isn't a better protocol ;)

    We have an average of 4,000 (S,G) entries here and it's generally all OK so long as the sources transmit regularly. But when they don't and the PIM tree starts to expire strange things will happen.

    A 6500 only checks something like 25% of the flows every minute (I forget the exact numbers), so different 6500s will expire the S,G entries at different times. When an inactive source starts up again some 6500s will still have (S,G) state but some won't. Consider if a 6500 ('A') sends a PIM join to a 6500 ('B') that already has a valid (S,G) entry. 'B' _won't_ forward the PIM join onwards, even though the next switch ('C') in the line may have already expired it's (S,G) entry, so the tree is now incomplete. Eventually 'B' will send the periodic PIM refresh to 'C' and the tree will build correctly.

    You don't normally notice this problem because the forwarding will still happen on the (*,G) tree until the (S,G) tree rebuilds. However, if you have multiple anycast RPs you don't have a fallback (*,G) tree between the RPs. The second RP (on receiving the SA) has to build an (S,G) tree back and if it can't build it then traffic isn't forwarded.

    We encounted this problem when the apps guys started doing multicast transmission back from desktop trading apps. If the app was closed for a short period and then restarted there could be traffic loss for the multicast traffic it was sending (depending on the state of the tree between the RPs).

    Lots of different solutions of course (like increasing the S,G expiry time everywhere), but SSM doesn't have any of this sort of nonsense (using multiple independent timeres to maintain forwarding state). The tree is just permently nailed up from the receiver towards the source.

    ReplyDelete
  13. Hey Sam,

    Yes, SSM is better.

    You're doing ASM->SSM mapping for desktop sources? Keeping track of desktops is tough for people who don't care about them (me). I guess you have robust processes :-)

    I haven't noticed the intermittently quiet source problem you describe. Probably because the vast majority of sources around me have been Tibco/RV, Reuters/RICH or the exchange feeds.

    Those sources will never stop: RV and RICH have a periodic no-op "hello" type message, and the exchange servers tend to start talking well before the opening bell, and then not stop.

    Funny story - the RV "hello" took down a large trading floor once. Tibco implemented the hello message in such a way that it calculated how many hellos should have been sent based on the daemon uptime. Whenever the count fell below the target (due to the passage of time), a hello got sent.

    Guess what happens if you move the system clock forward by a month? Suddenly, we're a month behind on sending out the 60 second periodic hello. A sniffer happened to be running for this hello-fest. It was awesome.

    ReplyDelete
  14. Hey Chris,

    We're not doing ASM->SSM mapping. In the end a background daemon got added to the desktops to keep churning out a periodic hello even if the app wasn't running ;)

    That RV hello story sounds like a riot! We got bit recently by RV's hardcoded TTL of 16. 16 hops should be enough for anyone, right? lol.

    ReplyDelete
  15. Very interesting - thanks for sharing.

    ReplyDelete
  16. Excellent article. As a fellow financial networker, I can attest to delicacies of this type of data, as well as the fun parts like the fire hose at the market open and close. These moments are eclipsed, however, by Fed announcements, and we benchmark all of our capacity around the Fed announcements. Do you ever find these moments presenting special challenges to trading floor environments that are not normally encountered during your run of the mill market open/close?

    ReplyDelete
    Replies
    1. Other than the open and close and fed announcements you've noticed, I've observed the other larger spike being the second open between 9:45-10.

      Delete
  17. Hi Brandon, thank you for the compliment.

    The large environment I used to run was purpose-built, and generally performed acceptably. I didn't sweat the open/close/announcement craziness. I spent most of my time looking into failures like the MoldUDP nonsense I wrote up yesterday.

    Your point is well taken, though. The calm before the storm in the moments leading up to a Fed announcement is downright eerie. And the storm can be brutal.

    ReplyDelete
  18. Excellent article, and an very interesting peek into the mcast/netadmin world of the trading markets.

    ReplyDelete
  19. Hello!

    Many thanks for your excellent post!

    Obviously I have a question about it: do you know of any tool that can extract and graph throughput data on a per sub-second average, on Cisco hardware? I don't think any kind of SNMP based plotter will do it as I seem to recall that the SNMP RFC says that the counters should only be updated every 5 seconds or so.

    Thanks!

    ReplyDelete
  20. Pere,

    The best tools here are sniffers with hardware-assisted capture (like an endace card) and Ethernet taps.

    You'll get lots of data, but it's possible to manage it. Using that data, you can make whatever graphs you want, with whatever level of granularity.

    If using lots of sniffers, look into getting an external timesource with PPS output. Cable all the sniffers together and time sync them with the PPS source for way-sub-millisecond timing precision.

    Plus, when things come off the rails, you'll be able to figure out exactly what went wrong. Packets from the wire represent *truth*.

    I don't put too much faith in port mirroring (SPAN) when device performance is in question, and especially when it comes to multicast data. Too many times I've seen the SPAN port make an unreliable claim about packet delivery.

    ReplyDelete
  21. Very interesting stuff! I've never worked with trading networks, but you never know ;-)

    ReplyDelete