Wednesday, October 7, 2015

Merge-sorting multiple sorted inputs

Reading ElasticSearch documentation, I found that every ES index shard is a Lucene index instance; user queries are first executed on each individual shard, and then results are merged before sending to the requester.

Given s lists of n elements each, sorted in descending order, we want to produce a sorted list that will contain top n items from the input lists, combined. The trivial algorithm is:
  1. Take the first item from every list
  2. Pick the largest item from these. Remove it from its input list, and output it
  3. Repeat the above steps until we output n elements
Using a basic (linear) algorithm to find the largest element, repeated n times, we would end up with O(ns) computational complexity.

However, given that the list of items in point 2 is almost identical every time (only 1 item changes between iterations), we can achieve much better results using a heap-based priority queue. Updated algorithm:
  1. Take the first item from every list, build a priority queue
  2. Pick and remove the largest item from the queue. Output the item.
  3. Add next item from the same list to the queue.
  4. Repeat steps 2 and 3 until we output n elements
Removing and adding are both logarithmic, so we end up having O(n lg s) computational complexity

Now, knowing how the priority queues are implemented, we can save one operation by replacing the top element in queue instead of removing + adding. This way we rebuild heap only once.
Note: in rare cases this could result in doing more operations instead of less.

Sample implementation in C++