Recently I faced a problem where our tool used to push files over SFTP started gathering a backlog, even though there was plenty of bandwidth available. The link had a pretty high latency, so the first thing I checked was TCP window scaling.
TCP can limit the bandwidth available on long fat networks (LFNs). Receive window limits the amount of data that can be sent over the link, but not yet acknowledged. Receive window in the original specification was limited to 64 KB. Bandwidth on any TCP link that does not employ TCP window scaling is limited to 64KB divided by the network latency - so if the latency is 0.1 seconds, the maximum transfer with 64KB window size is 640KB/s. With window scaling, the receive window can be resized up to 1 GB.
I found that TCP window scaling was implemented in Windows 2000. Linux kernels from year 2004 implement dynamic receive buffers, which are based on TCP window scaling. Since we use more recent OSes, I was pretty certain that window scaling was not a problem here. Using Wireshark I was able to confirm that indeed, the SYN packets advertised a non-zero scale factor.
Quick look at the code revealed that the protocol implementation we had required an immediate acknowledgement of every write message; the SFTP specification says that it is allowable to send multiple messages, and receive the acknowledgements later. Instead of waiting for an acknowledgement of every message, I started sending the entire file, then receiving all acknowledgements. This resulted in a massive speed up.
Driven by the desire to measure the improvement, I created a really large file, uploaded it using the old method, started uploading it using the new method, and... the transfer deadlocked. TCP socket can only buffer a limited amount of data, and what happened was, since I never retrieved the acknowledgements, at some point the server blocked on sending one. When the server blocked, it stopped processing my messages, which in turn blocked the sender.
Unfortunately checking if there's an acknowledgement available for reading in the buffer using the library we had was not an easy task, so I ended up limiting the number of unacknowledged messages to a known safe value, which still yielded a decent performance improvement.
No comments:
Post a Comment