Starting point
I'm looking for a replacement for a file storage system; current implementation stores the data on a RAID array. Every time we hit a size limit on the array, we replace drives with larger ones and copy the data over. Now that the data size increases by a few terabytes every year, the replacements are becoming costly. A system that would reduce the operating cost of file storage is needed.Hadoop
Hadoop distributed file system (HDFS) can scale out - instead of replacing the current hardware we just add more machines, and the new storage becomes immediately available for use. It can also run on commodity hardware, as in - it does not require vendor-specific hardware, but rather can run on generic server machines. Hadoop also has the ability to run MapReduce jobs on the stored content, and can be deployed on both Amazon and Microsoft cloud. Hadoop is free software, but at least 2 companies provide commercial support. What's not to like? Well, there's just one thing.Small file problem
HDFS is capable of scaling up to thousands of terabytes, with just one catch. Simplifying a little, the list of files has to fit in memory of a single machine - name node. The name node may require up to 1 GB of memory for every million of files, so storing hundreds of millions of files would require powerful hardware and careful tuning. In addition, the name node information has to be loaded from disk on every restart, which takes time proportional to the file size.There's no real solution to the small file problem. Workarounds are available:
- Use multiple independent name nodes sharing the same data nodes (i.e. machines physically hosting the data on their disks). This requires some partitioning logic on the client side, so that the client knows which name node to check for a particular file.
- Coalesce multiple files into a single Hadoop file, using HAR, sequence files or HBase. Each one of these has its limitations and won't suit every use case. I'll present them shortly.
Hadoop archive (HAR) files
HAR files are archives containing a small filesystem. The files and directories in a HAR archive are readable using the same interfaces as regular Hadoop files. HAR files are created by running a mapreduce job over a set of files and directories in HDFS. Once created, the archives are immutable.For our use case HAR files were not a good fit for a few reasons. One, we add files one at a time, so we would need to run the archive process on schedule. We also support random file lookup by ID, so we would need to update file location for all files once the archive process is complete. Another reason is, HAR files are immutable, and we update files quite a lot. We would need another process to periodically purge outdated file revisions by unpacking and repacking the archives. Both are doable, but require additional coding. We decided to explore the other alternatives first.
No comments:
Post a Comment