Wednesday, October 10, 2018

TCP war story 3: excessive packet reordering

So there is this user who says that suddenly our server started responding very slow. Earlier a page would load in a second, now it takes 4-5 minutes to load, he says. We run some checks, and the server is snappy as always. So we ask him for a Fiddler capture. Indeed, the loading times are high. We replay the same requests, and they are super fast here. We tell the user to go to network support. Network support says ping is good, packet loss is zero, no other users are complaining, must be a problem with the application.

How can there be a problem with the network if ping is good and there's no packet loss?

Well apparently there can.

We asked the user for a packet capture. This wasn't his first encounter with the network support, so he knew how to use Wireshark. Good for us. So we open the pcap file and immediately notice a lot of black color. Every second or so a packet arrives 5 milliseconds late, and other packets arrive before it. TCP stack reacts correctly by sending duplicate ACKs, and once the late packet arrives, there's one cumulative ack for everything. But that's too late, 3+ duplicate acks were sent. Once the server receives these, it stops slow start mode and implements congestion control, which is CTCP, so send rate is halved.
To make things worse, MSS is only 587 bytes, and RTT is on the order of 220 milliseconds. Resulting transfer rate is about 20 KB/s, far below 1Gbit/s that is normally available.

How do you convince the network support team that it's a problem that they need to fix? Well, I'm still trying to figure it out.