DJ's software: April 2017

In my previous post I mentioned that HBase does not perform well when retrieving large key/value pairs (known as cells). I decided to look under the surface, and here I will present some of my findings.

My goal is to create a HTTP file server serving data from HBase, similar to HBase REST, but with embedded business logic. I'd like to start sending the response as soon as possible, and send data at the highest rate permitted by hardware.

Surface

The first apparent problem with retrieving large cells is related to the HBase client API. The getter for cell value returns a byte array. This means that the cell value has to be fully buffered by the API before the application can access it. If the client intends to process the data as a stream, this is highly inefficient - the streaming can only happen after a delay.

Next I measured the time passed between sending the request and receiving the first byte of the response. For optimal performance this should be independent from cell size, but I found that this is not the case. That's because of the second, less apparent problem. The server buffers the entire response in memory before sending it to the client. The data has to be read from the underlying storage first, only then is it sent to the client. As a result, the time to start sending response is proportional to the cell size.

Internals

Under the hood HBase uses a RPC implementation for client-server communication. This means that server sends data to client only in response to a client call, and only one response per call is allowed. Scans are implemented as a series of "give me more" requests sent to the server. In current version these are sent when the client finishes processing the current batch, though HBASE-13071 will implement a background prefetch, retrieving next batch while the current one is still being processed by the application.

Concurrent RPC calls are multiplexed over the same socket connection. With larger messages it can create a head-of-line problem. This is mitigated to some extent by buffering the entire response in memory - usually the server is able to use the entire available bandwidth, so the throughput is the same (or better) as it would be with dedicated connections. However, in certain cases the multiplexing can reduce overall throughput, for example when client is running multiple threads sending memory-mapped files to the server. Multiplexing also forces the client to read the entire response as fast as possible in order to allow other responses to arrive.

Wire protocol is implemented using Google's Protobuf library, which is not designed to handle large messages. Protobuf does not allow access to part of the message before all of it is deserialized. This means that stream-reading the server responses on client side would require either a protocol modification, or using a custom-coded library.

Starting point

I'm looking for a replacement for a file storage system; current implementation stores the data on a RAID array. Every time we hit a size limit on the array, we replace drives with larger ones and copy the data over. Now that the data size increases by a few terabytes every year, the replacements are becoming costly. A system that would reduce the operating cost of file storage is needed.

Hadoop

Hadoop distributed file system (HDFS) can scale out - instead of replacing the current hardware we just add more machines, and the new storage becomes immediately available for use. It can also run on commodity hardware, as in - it does not require vendor-specific hardware, but rather can run on generic server machines. Hadoop also has the ability to run MapReduce jobs on the stored content, and can be deployed on both Amazon and Microsoft cloud. Hadoop is free software, but at least 2 companies provide commercial support. What's not to like? Well, there's just one thing.

Small file problem

HDFS is capable of scaling up to thousands of terabytes, with just one catch. Simplifying a little, the list of files has to fit in memory of a single machine - name node. The name node may require up to 1 GB of memory for every million of files, so storing hundreds of millions of files would require powerful hardware and careful tuning. In addition, the name node information has to be loaded from disk on every restart, which takes time proportional to the file size.
There's no real solution to the small file problem. Workarounds are available:

Use multiple independent name nodes sharing the same data nodes (i.e. machines physically hosting the data on their disks). This requires some partitioning logic on the client side, so that the client knows which name node to check for a particular file.
Coalesce multiple files into a single Hadoop file, using HAR, sequence files or HBase. Each one of these has its limitations and won't suit every use case. I'll present them shortly.

What happens if you have too many files? Name node process starts using excessive amounts of memory. High memory usage may decrease service responsiveness, and in extreme cases will cause it to quit completely.

Hadoop archive (HAR) files

HAR files are archives containing a small filesystem. The files and directories in a HAR archive are readable using the same interfaces as regular Hadoop files. HAR files are created by running a mapreduce job over a set of files and directories in HDFS. Once created, the archives are immutable.
For our use case HAR files were not a good fit for a few reasons. One, we add files one at a time, so we would need to run the archive process on schedule. We also support random file lookup by ID, so we would need to update file location for all files once the archive process is complete. Another reason is, HAR files are immutable, and we update files quite a lot. We would need another process to periodically purge outdated file revisions by unpacking and repacking the archives. Both are doable, but require additional coding. We decided to explore the other alternatives first.

Sequence files

Sequence files store data as a list of key/value pairs. It is possible to append new values to the end of an existing sequence file, so new data could be immediately stored to its final location. The main drawback of sequence files is, the files have to be read sequentially. In other words, to locate a document by ID in a sequence file we would potentially need to read all of the file. Since random access times are important to our use case, sequence files were not an option.

HBase

HBase is a key/value store system using Hadoop for data storage. Like sequence files, HBase also stores data as a list of key/value pairs. Like HAR files, HBase also indexes its content allowing for fast random access. It also handles all insert/update/delete operations transparently to the user, repacking its store files behind the scenes. HBase makes good use of HDFS by creating large files and appending to them whenever it makes sense, so the small files problem usually does not apply. Configuring HBase for optimal performance is not a trivial task, but when done well, HBase is also very fast, with one exception. Data read operations are non-streaming, so retrieving a value takes time proportional to its size. With 1GBps network adapter I got average response time of 100ms for 1 MB file, but an entire second for a 30MB file.

HBase/HDFS hybrid

With all of the above in mind, hybrid storage with small files stored in HBase and large files stored in HDFS may or may not be a good option; storing small files in HBase avoids the small files problem, and storing large ones in HDFS avoids increasing latency. The appropriate cutoff size depends on the data being stored, with lower cutoff demanding for more expensive name node hardware and higher resulting in worse latency for mid-sized files.

DJ's software

Friday, April 28, 2017

HBase and big key/value pairs