Starting point
I'm looking for a replacement for a file storage system; current implementation stores the data on a RAID array. Every time we hit a size limit on the array, we replace drives with larger ones and copy the data over. Now that the data size increases by a few terabytes every year, the replacements are becoming costly. A system that would reduce the operating cost of file storage is needed.
Hadoop
Hadoop distributed file system (HDFS) can scale out - instead of replacing the current hardware we just add more machines, and the new storage becomes immediately available for use. It can also run on commodity hardware, as in - it does not require vendor-specific hardware, but rather can run on generic server machines. Hadoop also has the ability to run MapReduce jobs on the stored content, and can be deployed on both Amazon and Microsoft cloud. Hadoop is free software, but at least 2 companies provide commercial support. What's not to like? Well, there's just one thing.
Small file problem
HDFS is capable of scaling up to thousands of terabytes, with just one catch. Simplifying a little, the list of files has to fit in memory of a single machine - name node. The name node may require
up to 1 GB of memory for every million of files, so storing hundreds of millions of files would require powerful hardware and careful tuning. In addition, the name node information has to be loaded from disk on every restart, which takes time proportional to the file size.
There's no real solution to the small file problem.
Workarounds are available:
- Use multiple independent name nodes sharing the same data nodes (i.e. machines physically hosting the data on their disks). This requires some partitioning logic on the client side, so that the client knows which name node to check for a particular file.
- Coalesce multiple files into a single Hadoop file, using HAR, sequence files or HBase. Each one of these has its limitations and won't suit every use case. I'll present them shortly.
What happens if you have too many files? Name node process starts using excessive amounts of memory. High memory usage may decrease service responsiveness, and in extreme cases will cause it to quit completely.
Hadoop archive (HAR) files
HAR files are archives containing a small filesystem. The files and directories in a HAR archive are readable using the same interfaces as regular Hadoop files. HAR files are created by running a mapreduce job over a set of files and directories in HDFS. Once created, the archives are immutable.
For our use case HAR files were not a good fit for a few reasons. One, we add files one at a time, so we would need to run the archive process on schedule. We also support random file lookup by ID, so we would need to update file location for all files once the archive process is complete. Another reason is, HAR files are immutable, and we update files quite a lot. We would need another process to periodically purge outdated file revisions by unpacking and repacking the archives. Both are doable, but require additional coding. We decided to explore the other alternatives first.
Sequence files
Sequence files store data as a list of key/value pairs. It is possible to append new values to the end of an existing sequence file, so new data could be immediately stored to its final location. The main drawback of sequence files is, the files have to be read sequentially. In other words, to locate a document by ID in a sequence file we would potentially need to read all of the file. Since random access times are important to our use case, sequence files were not an option.
HBase
HBase is a key/value store system using Hadoop for data storage. Like sequence files, HBase also stores data as a list of key/value pairs. Like HAR files, HBase also indexes its content allowing for fast random access. It also handles all insert/update/delete operations transparently to the user, repacking its store files behind the scenes. HBase makes good use of HDFS by creating large files and appending to them whenever it makes sense, so the small files problem usually does not apply. Configuring HBase for optimal performance is not a trivial task, but when done well, HBase is also very fast, with one exception. Data read operations are non-streaming, so retrieving a value takes time proportional to its size. With 1GBps network adapter I got average response time of 100ms for 1 MB file, but an entire second for a 30MB file.
HBase/HDFS hybrid
With all of the above in mind, hybrid storage with small files stored in HBase and large files stored in HDFS may or may not be a good option; storing small files in HBase avoids the small files problem, and storing large ones in HDFS avoids increasing latency. The appropriate cutoff size depends on the data being stored, with lower cutoff demanding for more expensive name node hardware and higher resulting in worse latency for mid-sized files.