Optimize TaskStateCounts aggregate pipeline #617
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In its current state, MongoDB does not make use of indexes in $group operations within an aggregate pipeline. If a lot of tasks exist in the db, this means that this simple aggregate -> sum operation will take a long time as it has to access all documents in the db and cannot just use an index. In real world usage, this was the longest frequently running query we noticed in the logs and it caused significant load on our db.
According to the docs, $match and $sort both are capable of using said index, and inserting them at the beginning of the aggregate pipeline yields a roughly five and ten-fold decrease in query runtime, respectively (according to 1k test queries run by @kmavrommatis on our db). This happens because $group can then operate on the $match or $sort results rather than polling all the documents each time.
for additional reference, see discussion at https://jira.mongodb.org/browse/SERVER-29444 and related issues regarding the proposal of eventually allowing simple group calls like this to use covered indices.