My goal is to create a HTTP file server serving data from HBase, similar to HBase REST, but with embedded business logic. I'd like to start sending the response as soon as possible, and send data at the highest rate permitted by hardware.
Surface
The first apparent problem with retrieving large cells is related to the HBase client API. The getter for cell value returns a byte array. This means that the cell value has to be fully buffered by the API before the application can access it. If the client intends to process the data as a stream, this is highly inefficient - the streaming can only happen after a delay.
Next I measured the time passed between sending the request and receiving the first byte of the response. For optimal performance this should be independent from cell size, but I found that this is not the case. That's because of the second, less apparent problem. The server buffers the entire response in memory before sending it to the client. The data has to be read from the underlying storage first, only then is it sent to the client. As a result, the time to start sending response is proportional to the cell size.
Internals
Under the hood HBase uses a RPC implementation for client-server communication. This means that server sends data to client only in response to a client call, and only one response per call is allowed. Scans are implemented as a series of "give me more" requests sent to the server. In current version these are sent when the client finishes processing the current batch, though HBASE-13071 will implement a background prefetch, retrieving next batch while the current one is still being processed by the application.Concurrent RPC calls are multiplexed over the same socket connection. With larger messages it can create a head-of-line problem. This is mitigated to some extent by buffering the entire response in memory - usually the server is able to use the entire available bandwidth, so the throughput is the same (or better) as it would be with dedicated connections. However, in certain cases the multiplexing can reduce overall throughput, for example when client is running multiple threads sending memory-mapped files to the server. Multiplexing also forces the client to read the entire response as fast as possible in order to allow other responses to arrive.
Wire protocol is implemented using Google's Protobuf library, which is not designed to handle large messages. Protobuf does not allow access to part of the message before all of it is deserialized. This means that stream-reading the server responses on client side would require either a protocol modification, or using a custom-coded library.