An issue on a customer's network had me thinking about IP fragmentation recently, and now I find myself pounding some things that I find interesting about fragmentation into my keyboard.
Where should an oversized datagram be sliced?
RFC791 suggests a scheme by which an IP datagram is sliced up so that the resulting fragments just fit out the constraining interface. This seems sensible, but there are some gotchas:
- If we fragment a 1500 byte packet to fit into a PPPoE link, we might wind up with 1492 bytes in the first datagram (20 bytes header, 1472 bytes payload) and 28 bytes in the second packet (20 bytes header, 8 bytes payload). This works great until that first fragment tries to transit a GRE tunnel (MTU 1476) further along its path. If the PPPoE router had chopped the datagram in half, both fragments would fit through the GRE tunnel without any problem.
- Depending on the MTU, we might not be able to make precisely MTU-sized fragments. This is because the fragment offset value in the IP header is expressed in terms of 8-byte chunks. Every IP fragment must have an offset that's a multiple of 8, so creating a 1499 byte fragment of a 1500 byte datagram just isn't possible.
What size is the header?
The "best" size for IP fragments might be to slice them into equal-ish sized chunks (aligned to fit on that 8 byte boundary), but even that is too simple: The IP header on the initial fragment might be a different size from the header applied to subsequent fragments, requiring the initial fragment to carry a smaller data payload than the following fragments. The issue here is IP options: some options must be present in all fragments, others options can be omitted from all but the first fragment. Generally speaking, IP options that are used by transit devices (source routing, for example) are the ones that must be copied into every fragment. You can tell whether an option will be copied every fragment by looking at the first bit (the copy bit) of the ip option number.
Fragments are sent in what order?
You might assume that the fragments of an IP datagram would be sent in order, starting with the fragment at offset zero. And you might be surprised.
Some versions of the Linux kernel send IP fragments in reverse order. The idea here is to effect a performance optimization (on other Linux systems, presumably). There are two facets to this strategy:
- Only the last fragment (the one with the MF bit set to zero) can tell you how big the whole packet is. Earlier fragments only tell you that "more" is coming, but you can't guess how much. By receiving the last fragment first, the receiver is able to optimize memory allocation for all fragments that comprise the datagram.
- Somehow it's easier to line the packets up and then copy them into memory when they arrive in this order. It's got something to do with not having to identify the end of the byte string, but rather jamming incoming payload right at the head of data we've already got. #NotAProgrammer
Fragments may be a nightmare at your security perimeter.
Non-first fragments can't be recognized by L4 and higher means because the L4 header is only present in the first fragment. No surprise there.
Modern firewalls mostly have this figured out, but router ACLs certainly don't, and even devices which can do fragment reassembly for inspection purposes might require the fragments to arrive in order (fragguard feature on PIX).
- A fragment containing bytes 0-5 might say: attach
- A subsequent fragment containing bytes 5-6 might say: k!
What about the target system? Linux, incidentally, will interpret this overlap as attach!
It probably doesn't matter. Overlapping-but-dissimilar data is probably enough of a reason to kill these packets before they reach their destination, and that's what most security devices will do if they notice this sort of nonsense. Some might even make the unfortunate decision to kill the associated TCP flow or UDP pseudoflow altogether.
It's unfortunate because mismatched-and-overlapped IP fragments are not necessarily the result of malicious intent. There's only a 16-bit IP ID space, so a big server talking to lots of clients can wrap these numbers pretty quickly, and packets transiting different network paths might be fragmented into different sized chunks. It's not likely, but an attempt to reassemble fragments of different IP packets bearing the same IP ID number is a real possibility.
I found myself wondering about the fragments I saw at my customer's edge recently, so I banged together a little script to visualize the packets arriving at the edge. Basically, I'm plotting fragment size vs. time, and color coding things so that I can recognize fragments from whole packets, and to pick out fragments which have arrived out of order.
The script produces some visual output. Rather than describing it, I've made a video.