Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize TaskStateCounts aggregate pipeline #617

Merged
merged 2 commits into from
Sep 10, 2019
Merged

Optimize TaskStateCounts aggregate pipeline #617

merged 2 commits into from
Sep 10, 2019

Conversation

tgjohnst
Copy link
Contributor

@tgjohnst tgjohnst commented Sep 10, 2019

In its current state, MongoDB does not make use of indexes in $group operations within an aggregate pipeline. If a lot of tasks exist in the db, this means that this simple aggregate -> sum operation will take a long time as it has to access all documents in the db and cannot just use an index. In real world usage, this was the longest frequently running query we noticed in the logs and it caused significant load on our db.

According to the docs, $match and $sort both are capable of using said index, and inserting them at the beginning of the aggregate pipeline yields a roughly five and ten-fold decrease in query runtime, respectively (according to 1k test queries run by @kmavrommatis on our db). This happens because $group can then operate on the $match or $sort results rather than polling all the documents each time.

for additional reference, see discussion at https://jira.mongodb.org/browse/SERVER-29444 and related issues regarding the proposal of eventually allowing simple group calls like this to use covered indices.

In its current state, MongoDB does not use indices in $group operations within an aggregate pipeline.  If a lot of tasks exist, this means that the simple aggregate sum operation will take a long time as it has to access all documents in the db and cannot just use the index of the state field. In real world usage, this was the longest frequently running query and caused significant load on our db. However, $match and $sort both are capable of using said index, and inserting them at the beginning of the pipeline yields a roughly ten-fold decrease in query runtime (in our hands) as $group can then operate on the $match or $sort results rather than polling all documents.
ran gofmt and this is the proper syntax
Copy link
Contributor

@adamstruck adamstruck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch and thanks for the contribution!

@adamstruck adamstruck merged commit 5a151d5 into ohsu-comp-bio:master Sep 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants