One morning around 10:00, a pricing support guy cornered me in the hallway: "Hey, did something happen at 9:34 this morning? We lost some data on the NASDAQ ITCH feed... Did you notice anything?"
When I got back to my desk, I found that pricing support had left some feed handler logs in my inbox. The logs explained that three consecutive pricing updates had been lost, and attributed the problem to "network loss" or somesuch. An incident had been opened, and I needed to get to the bottom of it.
At that time, the NASDAQ ITCH data feed was delivered as a stream of IP multicast packets containing UDP datagrams. Inside those UDP datagrams was a protocol known as MoldUDP.
MoldUDP is a simple encapsulation protocol for small messages which are intended to be delivered in sequential order. It assigns a sequence number to each message, prefaces each message with a two byte message length field, and then packs the messages into a MoldUDP packet using an algorithm that balances latency (dispatch the message NOW!) with efficiency (gee, there's still room in this packet, any more messages coming?) There were usually between 1 and 5 messages per packet in this environment.
The MoldUDP packet header includes:
- The sequence number of the first message in this packet.
- The count of messages in this packet.
- A "session ID" which allows receiving systems to distinguish multiple MoldUDP flows from one another.
|Downstream MoldUDP packet format|
- Each message is assigned a unique number
- Multiple messages can appear in a single packet
|MoldUDP message stream derived from sniffer capture|
I created these plots from sniffer data using the Perl Net::PcapUtils and NetPacket::UDP libraries, along with some MoldUDP-fu of my own. The data pictured here is not the data from the incident (I don't have it anymore), but it illustrates the same problem.
As I explained in the previous post, pricing networks don't just have redundancy, they have diversity.
Accordingly, the ITCH feed is delivered to consuming systems twice. The two copies of the data come from different NASDAQ systems, on different multicast groups, over different transit infrastructure, to different NICs on the receiving systems.
For the following image, I've "zoomed in" on the same data. Now we can clearly see that there are actually two message streams: one plotted in white, and one plotted in blue:
|Redundant MoldUDP message streams|
Taking an even closer look, we can see that each line is actually composed of discrete elements:
Handling Data Loss
MoldUDP includes a retransmission capability so that receiving systems are allowed to request that lost data be resent. Rather than requesting the data from the source server, the receivers are configured to use a set of dedicated retransmission servers. It's not generally expected that the retransmission capability should be used because:
- The latency might be unacceptable.
- Retransmissions are unicast - sending retransmissions for dozens of unhappy receivers puts WAN links at risk.
- Everything is redundant and diverse anyway -- we shouldn't have this problem.
The receiving systems know the highest sequence number they've seen, and they're always looking for the next number in the sequence (highest + 1).
Realities of geography mean that the data stream from NASDAQ's "A" site should always arrive at the feed handler NIC before the copy of the same data comes from NASDAQ's "B" site, but the receivers don't know about the geography. They have no expectation about which stream will deliver the next interesting packet. They just inhale the multicast stream arriving at each NIC (diversity!) and with each packet's arrival the messages either get processed because they're new, or trashed because the messages have been seen already.
The receivers trash a lot of data, half of it, in fact. Every message delivered in blue packets that we've seen so far in these diagrams would be trashed because it is a duplicate of a previously-seen-and-processed message which arrived in a white packet.
The feed handler error logs indicated that the whole population of servers in one data center didn't receive three specific consecutive MoldUDP messages. Both streams were functional, and the many (many!) drop counters in the path did not indicate that there'd been any loss.
Servers in the other data center had no problems. Servers in the test environment also had no problems.
I pulled a couple of minutes of data from each of four sniffer points
- The problem site's "A" feed handoff
- The problem site's "B" feed handoff
- The good site's "A" feed handoff (the "B" feed here was down because of a circuit failure)
- The test site's "B" feed (no "A" feed in test because of budget constraints)
Interestingly, the missing data had been batched differently (this is not uncommon) by the two head-end servers:
- The "A" server put these 3 messages into the end of a large MoldUDP packet, along with earlier messages that had been received correctly.
- The "B" server batched these 3 messages into two different packets: one contained only the first missing message, the other contained the remaining two messages, plus a third message that had not been a problem.
Bad NASDAQ Server
The head-end servers responsible for this data feed had a nasty habit. Every now and then, one of them would just stop transmitting data. After 100 or 200ms or so the server would start back up.
When this freezing happens, no data gets lost. All the data gets delivered, but it gets delivered fast as the service "catches up" with real time. In 100ms we'd usually get hundreds of packets containing thousands of messages. When the blue server locks up there is no problem. His data was going to be trashed anyway. When the white server locks up, funny things happen. Here's what that looks like:
|100ms of silence from the primary site|
I know that the problem here is on the NASDAQ end, and not in the transit path because of the message batching. Usually we get between 1 and 5 messages per packet. Message batching during the catch-up interval was closer to 70 messages/packet. Only the source server (NASDAQ) could have done this. Network equipment in the transit path can't re-pack messages into fewer packets.
|Closer view of the problem|
Then, for about 100ms, servers received only "blue" data. Next, at 41.27, the backlog of "white" data started screaming in. Most of it was garbage (having already been delivered by the "blue" source) until we get to sequence ~31884700. At this point, the stream arbitration mechanism should switch back from "blue" data to "white" data. Here's a closeup of that moment:
|Takeover of primary data stream|
What Went Wrong?
|Same picture as above, with extra color|
The stream arbitration mechanism should have dropped the blue stream, and picked up the white stream beginning with the packet that I've painted red. It didn't. Instead, the feed handler (a commercial product) was making the process-or-trash decision for each packet based solely on the first message sequence number from the MoldUDP header. The possibility that a packet might begin with old data but also contain some fresh data hadn't been considered, but that's what happened here.
The red packet began with a sequence that had already been processed on the blue stream, so the feed handler trashed it. Next, another "white" packet arrived. This packet began with a sequence much higher than expected. Clank! Sproing! Gap detected, alarm bells rang, log files were written, etc...
The "missing" data was actually present in the top half of that red packet, and then was delivered again a short time later in a series of "blue" packets. Rotten stream arbitration code in the feed handler was the whole problem here.
No matter. The application said "network packet loss" so the problem landed in my lap :)
I worked with the software vendor to get an enhancement -- now they check the sequence number and the message counter in each packet before trashing it. I'm guessing that things were implemented this way because an earlier version of MoldUDP didn't contain a message counter. With this previous version, the only way to determine exactly which messages appeared in a given packet was to walk through the packet from beginning to end. Yuck.
No Problem in the Other Sites
I'd previously said that only one of three environments had a problem. This was because the other environments weren't doing stream arbitration: The test environment only had one data stream because of cost concerns. The alternate production environment was running with only the one stream because of a circuit problem. These other sites didn't notice any problem because they never switched from one feed to the other.