Skip to content

Cut a few seconds off of every test using Cockroach #6934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 27, 2024

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Oct 25, 2024

This one has some history!

While working on some database-related APIs, I noticed that my tests using CockroachDB were a lot slower than other tests. To some degree, I expect this, but also, this was on the order of ~5-10 seconds per test doing very little other than CockroachDB setup and teardown.

After doing a little profiling, I noticed that tests took several seconds to perform teardown, which increases significantly if any schema changes occurred.

Why does teardown take so long? Well, it turns out we are sending SIGTERM to the CockroachDB process to gracefully terminate it, instead of SIGKILL, which would terminate it much more abruptly.

This is where the history comes in: Gracefully terminating CockroachDB was a choice we made a few years ago to avoid a test flake: #540. Basically, when creating the "seed database" -- a database where we apply schema changes that are shared between many tests -- we want to gracefully terminate to avoid leaving the database in a "dirty state", where it might need to flush work and cleanup intermediate state. In the case of #540, that "dirty intermediate state" was an absolute path, which meant copies of that seed database trampled on each other if graceful shutdown did not complete.

Our approach was to apply graceful termination to all CockroachDB teardown invocations, but this was overkill.
Only the seed database expects to have storage be in-use after the call to cleanup -- all other test-only invocations expect to immediately remove their storage. They don't need to terminate gracefully, and arguably, should just exit as quickly as they can.

This PR changes the disposition:

  • cleanup_gracefully uses SIGTERM, and waits for graceful cleanup. This is still used when constructing the seed db.
  • cleanup uses SIGKILL, and kills the database immediately. This is now used for all other use-cases.

As an example in the performance difference, here's a comparison for some datastore tests:

Before

  SETUP PASS [      1/1] crdb-seed: cargo run -p crdb-seed --profile test
        PASS [   6.996s] nexus-db-queries db::datastore::db_metadata::test::ensure_schema_is_current_version
        PASS [   7.344s] nexus-db-queries db::datastore::db_metadata::test::schema_version_subcomponents_save_progress
        PASS [   8.609s] nexus-db-queries db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward
------------
     Summary [  11.386s] 3 tests run: 3 passed, 228 skipped

After

  SETUP PASS [      1/1] crdb-seed: cargo run -p crdb-seed --profile test
        PASS [   2.087s] nexus-db-queries db::datastore::db_metadata::test::ensure_schema_is_current_version
        PASS [   3.054s] nexus-db-queries db::datastore::db_metadata::test::schema_version_subcomponents_save_progress
        PASS [   4.442s] nexus-db-queries db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward
------------
     Summary [   7.550s] 3 tests run: 3 passed, 228 skipped

@smklein smklein marked this pull request as ready for review October 25, 2024 19:44
@smklein smklein requested review from davepacheco and iliana October 25, 2024 19:44
@@ -137,7 +137,6 @@ impl DbRunArgs {
// receive SIGINT.
tokio::select! {
_ = db_instance.wait_for_shutdown() => {
db_instance.cleanup().await.context("clean up after shutdown")?;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this line because wait_for_shutdown already is calling cleanup

@@ -593,24 +609,43 @@ impl CockroachInstance {
.await;
}
self.child_process = None;

// It shouldn't really matter which cleanup API we use, since
// the child process is gone anyway.
self.cleanup().await
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See above comment) this is where wait_for_shutdown already calls cleanup

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I only wonder if db-dev ought to shut down gracefully but it seems like not a big deal either way.

@smklein smklein merged commit dceed97 into main Oct 27, 2024
16 checks passed
@smklein smklein deleted the optimize-crdb-for-tests branch October 27, 2024 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants