I was curious how ElasticSearch handles merging multiple result sets (one from each shard) into a single sorted response. In my last post I went over a few methods of merging multiple sorted streams. Then I went to search how ES handles this.
I assumed that ES uses a priority queue. Searching for PriorityQueue in Elasticsearch resulted in a bug report containing interesting stack trace.
It appears that SearchPhaseController.sortDocs is the method responsible for merging results.
In current version the sortDocs method uses Lucene's TopDocs.merge to do the sorting. The merge method uses priority queue containing one record for every input stream, removing and adding streams for every output record.
Quick look at the PriorityQueue implementation reveals updateTop method; this method's JavaDocs claim it to be "at least twice as fast" compared to removing & adding.
Modifying the method to use updateTop instead of pop/add was a rather trivial exercise. The new version is not twice as fast as the old one; the speed gains were on the order of 20-80%, depending on the input provided.
The patch was accepted, so next ElasticSearch's release is likely to contain the improved version.
No comments:
Post a Comment