Tuesday, January 3, 2012

Dispatches From The Trading Floor - MoldUDP

My pricing networks post has gotten a lot of feedback. Because of its popularity, I've decided to write up a case study detailing one of the interesting problems I was asked to solve.

The Incident
One morning around 10:00, a pricing support guy cornered me in the hallway: "Hey, did something happen at 9:34 this morning? We lost some data on the NASDAQ ITCH feed... Did you notice anything?"

When I got back to my desk, I found that pricing support had left some feed handler logs in my inbox. The logs explained that three consecutive pricing updates had been lost, and attributed the problem to "network loss" or somesuch. An incident had been opened, and I needed to get to the bottom of it.

Background
At that time, the NASDAQ ITCH data feed was delivered as a stream of IP multicast packets containing UDP datagrams. Inside those UDP datagrams was a protocol known as MoldUDP.

MoldUDP is a simple encapsulation protocol for small messages which are intended to be delivered in sequential order. It assigns a sequence number to each message, prefaces each message with a two byte message length field, and then packs the messages into a MoldUDP packet using an algorithm that balances latency (dispatch the message NOW!) with efficiency (gee, there's still room in this packet, any more messages coming?) There were usually between 1 and 5 messages per packet in this environment.

The MoldUDP packet header includes:
  • The sequence number of the first message in this packet.
  • The count of messages in this packet.
  • A "session ID" which allows receiving systems to distinguish multiple MoldUDP flows from one another.
The resulting packet looks something like this:
Downstream MoldUDP packet format

The key things to remember about MoldUDP for this story are:
  • Each message is assigned a unique number
  • Multiple messages can appear in a single packet
Every morning, the MoldUDP sequence starts at zero. As the day goes by, the message numbers go up. I like to visualize the message stream like this:
MoldUDP message stream derived from sniffer capture
The plot shows the MoldUDP sequence numbers recevied vs. time. The slope of the graph indicates the message rate. In this case, we got about 17,000 messages in about 2.5 seconds. 6800 messages per second overall, with a couple of little blips and dips in the rate (slope).

I created these plots from sniffer data using the Perl Net::PcapUtils and NetPacket::UDP libraries, along with some MoldUDP-fu of my own. The data pictured here is not the data from the incident (I don't have it anymore), but it illustrates the same problem.


Diversity
As I explained in the previous post, pricing networks don't just have redundancy, they have diversity.

Accordingly, the ITCH feed is delivered to consuming systems twice. The two copies of the data come from different NASDAQ systems, on different multicast groups, over different transit infrastructure, to different NICs on the receiving systems.

For the following image, I've "zoomed in" on the same data. Now we can clearly see that there are actually two message streams: one plotted in white, and one plotted in blue:
Redundant MoldUDP message streams
It may look like the blue stream is "below" the white one, but it's not. The blue stream is "after" the white stream. Blue represents a redundant copy of the data that took a longer path from NASDAQ to the servers. Shift that blue stream to the left (earlier) by a few milliseconds, and the streams should overlap perfectly.

Batching
Taking an even closer look, we can see that each line is actually composed of discrete elements:
Individual Packets
Each white mark in that picture represents a single MoldUDP packet containing several messages. The length and positioning of the packet along the "message number" axis indicate exactly which messages are in the packet. Long marks indicate packets containing many messages. Short marks represent packets containing fewer messages.

Handling Data Loss
MoldUDP includes a retransmission capability so that receiving systems are allowed to request that lost data be resent. Rather than requesting the data from the source server, the receivers are configured to use a set of dedicated retransmission servers. It's not generally expected that the retransmission capability should be used because:
  • The latency might be unacceptable.
  • Retransmissions are unicast - sending retransmissions for dozens of unhappy receivers puts WAN links at risk.
  • Everything is redundant and diverse anyway -- we shouldn't have this problem.


Stream Arbitration
The receiving systems know the highest sequence number they've seen, and they're always looking for the next number in the sequence (highest + 1).

Realities of geography mean that the data stream from NASDAQ's "A" site should always arrive at the feed handler NIC before the copy of the same data comes from NASDAQ's "B" site, but the receivers don't know about the geography. They have no expectation about which stream will deliver the next interesting packet. They just inhale the multicast stream arriving at each NIC (diversity!) and with each packet's arrival the messages either get processed because they're new, or trashed because the messages have been seen already.

The receivers trash a lot of data, half of it, in fact. Every message delivered in blue packets that we've seen so far in these diagrams would be trashed because it is a duplicate of a previously-seen-and-processed message which arrived in a white packet.

The Symptom
The feed handler error logs indicated that the whole population of servers in one data center didn't receive three specific consecutive MoldUDP messages. Both streams were functional, and the many (many!) drop counters in the path did not indicate that there'd been any loss.

Servers in the other data center had no problems. Servers in the test environment also had no problems.

Analysis
I pulled a couple of minutes of data from each of four sniffer points
  • The problem site's "A" feed handoff
  • The problem site's "B" feed handoff
  • The good site's "A" feed handoff (the "B" feed here was down because of a circuit failure)
  • The test site's "B" feed (no "A" feed in test because of budget constraints)
Picking through the captures I was able to identify the "missing" data at each of the four sniffer points. Not only had the data been delivered to the site which logged the errors, it was delivered to that site twice. Both feeds had delivered the data intact.

Interestingly, the missing data had been batched differently (this is not uncommon) by the two head-end servers:
  • The "A" server put these 3 messages into the end of a large MoldUDP packet, along with earlier messages that had been received correctly.
  • The "B" server batched these 3 messages into two different packets: one contained only the first missing message, the other contained the remaining two messages, plus a third message that had not been a problem.

Bad NASDAQ Server
The head-end servers responsible for this data feed had a nasty habit. Every now and then, one of them would just stop transmitting data. After 100 or 200ms or so the server would start back up.

When this freezing happens, no data gets lost. All the data gets delivered, but it gets delivered fast as the service "catches up" with real time. In 100ms we'd usually get hundreds of packets containing thousands of messages. When the blue server locks up there is no problem. His data was going to be trashed anyway. When the white server locks up, funny things happen. Here's what that looks like:
100ms of silence from the primary site
Remember that slope represents message rate. The slope of the white line hits around 200,000 messages per second (up from 6800) during the catch-up phase. Yikes.

I know that the problem here is on the NASDAQ end, and not in the transit path because of the message batching. Usually we get between 1 and 5 messages per packet. Message batching during the catch-up interval was closer to 70 messages/packet. Only the source server (NASDAQ) could have done this. Network equipment in the transit path can't re-pack messages into fewer packets.

Closer view of the problem
When the primary NASDAQ server stopped talking around the 41.17 time mark, receivers were expecting message number ~31883100. It didn't arrive until the blue stream sent a packet containing that message around 20ms later. At this point, receivers stopped trashing blue data, and started processing messages from the blue stream.

Then, for about 100ms, servers received only "blue" data. Next, at 41.27, the backlog of "white" data started screaming in. Most of it was garbage (having already been delivered by the "blue" source) until we get to sequence ~31884700. At this point, the stream arbitration mechanism should switch back from "blue" data to "white" data. Here's a closeup of that moment:
Takeover of primary data stream
This is where things come off the rails. Note how large the white packets (~70 messages each) are when compared with the blue packets (~5 messages each). After the white stream is "caught up" to real time, the batching rate drops down to the usual ~5 messages/packet.

What Went Wrong?
Same picture as above, with extra color

The stream arbitration mechanism should have dropped the blue stream, and picked up the white stream beginning with the packet that I've painted red. It didn't. Instead, the feed handler (a commercial product) was making the process-or-trash decision for each packet based solely on the first message sequence number from the MoldUDP header. The possibility that a packet might begin with old data but also contain some fresh data hadn't been considered, but that's what happened here.

The red packet began with a sequence that had already been processed on the blue stream, so the feed handler trashed it. Next, another "white" packet arrived. This packet began with a sequence much higher than expected. Clank! Sproing! Gap detected, alarm bells rang, log files were written, etc...

The "missing" data was actually present in the top half of that red packet, and then was delivered again a short time later in a series of "blue" packets. Rotten stream arbitration code in the feed handler was the whole problem here.

No matter. The application said "network packet loss" so the problem landed in my lap :)

I worked with the software vendor to get an enhancement -- now they check the sequence number and the message counter in each packet before trashing it. I'm guessing that things were implemented this way because an earlier version of MoldUDP didn't contain a message counter. With this previous version, the only way to determine exactly which messages appeared in a given packet was to walk through the packet from beginning to end. Yuck.

No Problem in the Other Sites
I'd previously said that only one of three environments had a problem. This was because the other environments weren't doing stream arbitration: The test environment only had one data stream because of cost concerns. The alternate production environment was running with only the one stream because of a circuit problem. These other sites didn't notice any problem because they never switched from one feed to the other.

28 comments:

  1. Incredible write up and diagnosis mate. I hope they're paying you a lot of money!

    ReplyDelete
  2. Thanks for the positive feedback. It was a long post on a dull topic. I'm a little surprised anybody stuck with me all the way to the comment box!

    @dhanakane The company laid me off a few months after this incident :-)

    ReplyDelete
  3. Super Post. I've never been walked through troubleshooting process like that. Loved it.

    So...uh, what are the changes you can show us in detail how you made those graphs?

    ReplyDelete
  4. great find. The company is retarded

    ReplyDelete
  5. This is the actual feed but what about the management/control traffic? They are also multicast right. Are all these trading systems serial connections? If yes how do they get Ethernet interfaces?

    ReplyDelete
  6. Hey, Anon.
    I'm not sure I follow your question. This post is mostly about MoldUDP, an encapsulation protocol comprised *mostly* of downstream packets, which are what I've discussed here.

    MoldUDP itself doesn't really have control traffic, in the same way that TCP doesn't have control traffic.

    MoldUDP does have retransmission capability (covered in my next post), and it does have some clocking/syncing/noop functions, but neither of those represent back-and-forth interaction between the origin server and a client.

    The various exchange feeds (ITCH 3.0 in this example) may have control traffic, but not related to realtime data. The challenges there are getting the systems primed with "initial records" prior to the opening bell at 9:30 in the morning.

    The exchanges don't generally interact with pricing feed customers much. They turn on the firehose, and don't concern themselves with having completed any sort of handshake with you.

    It's a different story within a customer's "ticker plant" environment. That place is very interactive. Workstations go on and offline, specific securities are subscribed to and unsubscribed from... it's a totally different animal. It is common for both upstream (subscription requests) and downstream (pricing data) to be multicast in these environments. In the comments to my "up is down" post, I explained that this is why SSM probably isn't a good fit on the trading floor.

    The feed I've identified in this case is an IP data feed. It can be delivered over any medium that's fast enough, and has a mechanism for encapsulating IP packets. If it's delivered to your office on a serial line (maybe an DS3?), then you'll have a router with two interfaces talking IP: the serial interface is plugged into the telco, and the Ethernet interface plugged into your LAN.

    ReplyDelete
  7. I had RMDS DTS in mind, limited background in this field so not sure what category this falls under but I know it has a control/management traffic using multicast.

    ReplyDelete
  8. Ahh, gotcha. So, in this post I was focusing on the Exchange -> Feed Handler portion. RMDS lives between the Feed Handler and the data consumers. You're right, that in the RMDS world, both upstream and downstream data are multicast (or broadcast). The specific details depend on which components are in use, and which layer of the sandwich you're talking about.

    ReplyDelete
  9. you stated...
    "the ITCH feed is delivered to consuming systems twice. The two copies of the data come from different NASDAQ systems, on different multicast groups, over different transit infrastructure, to different NICs on the receiving systems."

    This statement is true for almost all other feeds, however for tvitch it is not true. The redundant stream from Nasdaq for tvitch is actually a copy of the data from their main stream. It does *not* come from a different system. Furthermore the secondary stream is routed separately however is approx. 700ms behind the primary data stream - making it worthless

    Other than that, this is a fine write up

    ReplyDelete
  10. Hi Anon,

    I don't grok the distinction you're making here. Follow any of these exchange feeds back far enough and you'll find that they all boil down to two copies of the same data.

    In the case of the data that I've presented here, we've got a single message stream duplicated into two *unique* IP packet streams. The packet streams appear to originate from 206.200.244.100 and 206.200.246.100. Now, I suppose that these two IP addresses could be NATs, or they could be two interfaces on the same server, but it doesn't make any practical difference. From an IP processing point of view, they're different streams, not duplicates, as indicated by the batching nonsense.

    On the latency front, your comments don't agree with my packet captures. The captures are > 4 years old, so maybe things have changed?

    The time delay between "A" feed and "B" feed is well under 20ms, and never gets anywhere near 700ms. Have a look at the 5th image in this post. Message 31882000 was delivered by "A" at about 41.0900, and then by "B" at 41.0950. This data was collected ~200 miles from NYC, so there's little chance that my "A" network pathwas 700ms longer than the "B" path :-)

    The "A" vs "B" horse race that I've presented here seems to contradict both of your points. The fact that "B" sometimes overtakes "A" suggests that these
    streams operate somewhat independently (rather than "B" being a fork of the "A" customer feed), and also shows that the "B" stream has value, because sometimes it overtakes the "A" stream.

    I conceed that things have probably changed since I last looked at it.

    ReplyDelete
    Replies
    1. I'll confirm that the difference in the A & B feed should be around sub ~20ms as each feed is published from the primary and secondary datacenter. If one has the most efficient paths available the latency between the two feeds could probably be as good at ~7-8 ms.

      As far as the difference in mold packets with this particular issue, each itch host reads a stream of data, processes it and ships it out (open the pipe up). If a particular host gaps the source stream (in this case the white line's gap and subsequent flood), it will rewind the gap and process messages in-order, just slightly delayed.

      As far as the differences between the two feeds, mold makes the decision of how to encapsulate the messages and ship them out. Presumably in this case, there was a gap on the white itch feed, and it was filled, providing full write buffers for the mold writer to pump out the data.

      Delete
    2. It would be improper to assume that the A & B feeds would have identical packetization with the header rewritten to come from a different source. The ideal consumption of this data would process the mold layer independently from the itch messages.

      Delete
    3. Hello Feb 5th Anon :-)

      We're in agreement about how stuff works here, both on the mold packet packing issue, and about making assumptions about the similarity of the feeds.

      The bummer is that two different organizations let me down here: NASDAQ, for having an unreliable service (this stopping and starting business happened regularly -- every few minutes, or perhaps every minute, and the software house (marsupials, they) for wrongly assuming that they could discard an incoming packet based only on the first mold sequence in the packet.

      There's a third party causing problems here too - the WAN provider who was re-ordering mold packets in flight, as I detailed in the next post.

      Delete
  11. Many thanks again for an interesting read. Yes, interesting because it's a situation I'm not familiar with but also because there aren't manny people writing about real life troubleshooting. I guess because they think it's boring... Not so!

    ReplyDelete
  12. "The captures are > 4 years old, so maybe things have changed?" yes, things have changed drastically with tvitch. Mostly due to Nasdaq's move to the new data centers. They decided to do things on the cheap. The "B" feed is a packet copy of the "A" that is made several hops downstream within Nasdaq; then routed to their backup data center; then sent out on the wire. Minimum over 700 millis behind the "A" feed. Nasdaq is the worst

    ReplyDelete
    Replies
    1. Interesting, thanks for following up!

      Do they re-write the IP header at all between these feeds? Source? Destination? Both?

      I guess I shouldn't be surprised about the changes. Things were changing rapidly with this feed in 2008. It felt like we had a new version or a new bandwidth requirement every week!

      I certainly share your opinions about the exchange, and this feed in particular. The servers performed badly as I demonstrated, the encapsulation is inconsistent, the underlying protocol is clunky, and they sent the entire feed in a single multicast stream, rather than segmenting it into parallel streams like most every other data feed.

      Between the clunky exchange offering and the clunky (marsupial-based) COTS feed handler, this turd was the worst feed of the many I used to handle.

      Delete
  13. Absolutely amazing and super well-written case story!
    Chris - I just love the way you "provoke" me while I read, just to find out that you are right all the way.

    ReplyDelete
  14. I enjoyed reading this through and through. Thanks for writing it up.

    ReplyDelete
  15. This comment has been removed by the author.

    ReplyDelete
  16. Awesome blog! Thanks for posting this. As an EE interested in getting a real time feed, since Yahoo doesn't supply real time data subscription any more, I'm looking to get data from NASDAQ. So I have been fishing on the web for info. Been looking into coding for UDP and decoding the packets. Do you know if mouldUDP64 has staying power, or is there some other type of data stream that is up and coming? Does Nasdaq actually provide non-professional subscriptions to data feeds, or is that a joke? Is it possible to subscribe to be a California or even out of the country? Do you know if it is possible to subscribe to be a multicast client in the west coast?

    ReplyDelete
  17. Hey spiritrig,

    > Do you know if mouldUDP64 has staying power, or is there some other type of data stream that is up and coming?

    I know of at least one exchange product under development that will use MoldUDP64.

    > Does Nasdaq actually provide non-professional subscriptions to data feeds, or is that a joke?

    They're in the business of selling this data. I'd think they'd sell it to anyone.

    > Is it possible to subscribe to be a California or even out of the country?

    Definitely. You need two things, a circuit and a data subscription. The circuit can either be a point-to-point connection between you and the exchange, or it can come from a provider like Radianz or Savvis, who already have the various data feeds replicating around their network, so you'll only have to pay for your half of the connectivity.

    I expect that this will be spectacularly expensive. NASDAQ may even require you to implement an entitlement system which "proves" (to their satisfaction, anyway) that you're not further redistributing the data.

    ReplyDelete
  18. Hi chris, thanks for responding so quickly.

    It looks like all of these data feed IP addresses use MoldUDP64.
    http://www.nasdaqtrader.com/Trader.aspx?id=FeedMIPS

    It looks like last sale data is $0.60 month for Nasdaq Last Sale data. That's why I ask if it is a joke, 60 cents a month, but the guidance for non-professionals implies I have to go to a distributor. If I could get it for 60 cents a month from nasdaq, why would a distributor waste time that is more valuable than that without a huge mark up? I have a feeling nasdaq will say get it from a distributor, and the distributor won't charge 60 cents a month. It started out sounding like data would be democratized, but is a middleman required?
    http://www.nasdaqtrader.com/Trader.aspx?id=DPUSData#ls
    Below that it quotes a $1,500 distributor fee.

    As a person pursuing a hobby and not trading, I think I certainly qualify as a non-professional.
    Data News #2007 - 33 NASDAQ Provides Guidance for Non-Professional Usage
    http://www.nasdaqtrader.com/Tradernews.aspx?id=nva2007-033

    I am not sure I understand what you mean by point-to-point connection. Wouldn't that be moldUDP64? My understanding is UDP is point to point. I would think the simplest subscription receivable would be instructions on how to login or authenticate myself to receive the feed—specification and password for the authentication (login) packet to send.

    ReplyDelete
    Replies
    1. I can't help at all on the questions of procuring the data from a contract perspective. I've never done that stuff.

      By "point to point" circuit, I mean a wire with two ends: One end plugged into a router at your location, the other end plugged into a router in New Jersey somewhere. Of course it's not *really* going to be a wire - the telco you buy it from will have done some magic so that the middle portion is carried over their existing infrastructure.

      Maybe there's a hobbyist option with which I'm not familiar. I've only ever done this work for the sort of companies who tend to get dragged into congressional hearings, never at small scale.

      I'm guessing that the subscription fees you're seeing assume you already have a platform for consuming the data. This would be an add-on to your Bloomberg terminal, for example.

      Take all of this with a grain of salt - I have no idea what I'm talking about :)

      Delete
    2. OK, I'll lick the grains of salt of my margarita while I wait for Telegraphicos de Mexico to hook up a telegraph line to New Jersey after paying the bribe to show what a dumb gringo I am.

      I got a good hint here. "For market-data distribution, two methods are common: UDP (User Datagram Protocol) Multicast for collocated customers, and TCP (Transmission Control Protocol) for noncollocated customers."
      http://queue.acm.org/detail.cfm?id=2536492

      So since I am not co-located, TCP is the way. Just have to sign some papers, get'm to take the money, and if I am successfully subscribed, sip from the fire hose and code away.

      Delete
  19. This comment has been removed by the author.

    ReplyDelete
  20. Hi Chris, this is a great post, im looking for some answer about MoldUDP, do you know what the diferences are between MoldUDP and MoldUDP64? Thnaks in advance

    ReplyDelete