Cut a few seconds off of every test using Cockroach #6934

smklein · 2024-10-25T19:36:24Z

This one has some history!

While working on some database-related APIs, I noticed that my tests using CockroachDB were a lot slower than other tests. To some degree, I expect this, but also, this was on the order of ~5-10 seconds per test doing very little other than CockroachDB setup and teardown.

After doing a little profiling, I noticed that tests took several seconds to perform teardown, which increases significantly if any schema changes occurred.

Why does teardown take so long? Well, it turns out we are sending SIGTERM to the CockroachDB process to gracefully terminate it, instead of SIGKILL, which would terminate it much more abruptly.

This is where the history comes in: Gracefully terminating CockroachDB was a choice we made a few years ago to avoid a test flake: #540. Basically, when creating the "seed database" -- a database where we apply schema changes that are shared between many tests -- we want to gracefully terminate to avoid leaving the database in a "dirty state", where it might need to flush work and cleanup intermediate state. In the case of #540, that "dirty intermediate state" was an absolute path, which meant copies of that seed database trampled on each other if graceful shutdown did not complete.

Our approach was to apply graceful termination to all CockroachDB teardown invocations, but this was overkill.
Only the seed database expects to have storage be in-use after the call to cleanup -- all other test-only invocations expect to immediately remove their storage. They don't need to terminate gracefully, and arguably, should just exit as quickly as they can.

This PR changes the disposition:

cleanup_gracefully uses SIGTERM, and waits for graceful cleanup. This is still used when constructing the seed db.
cleanup uses SIGKILL, and kills the database immediately. This is now used for all other use-cases.

As an example in the performance difference, here's a comparison for some datastore tests:

Before

  SETUP PASS [      1/1] crdb-seed: cargo run -p crdb-seed --profile test
        PASS [   6.996s] nexus-db-queries db::datastore::db_metadata::test::ensure_schema_is_current_version
        PASS [   7.344s] nexus-db-queries db::datastore::db_metadata::test::schema_version_subcomponents_save_progress
        PASS [   8.609s] nexus-db-queries db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward
------------
     Summary [  11.386s] 3 tests run: 3 passed, 228 skipped

After

  SETUP PASS [      1/1] crdb-seed: cargo run -p crdb-seed --profile test
        PASS [   2.087s] nexus-db-queries db::datastore::db_metadata::test::ensure_schema_is_current_version
        PASS [   3.054s] nexus-db-queries db::datastore::db_metadata::test::schema_version_subcomponents_save_progress
        PASS [   4.442s] nexus-db-queries db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward
------------
     Summary [   7.550s] 3 tests run: 3 passed, 228 skipped

smklein · 2024-10-25T19:46:37Z

dev-tools/db-dev/src/main.rs

@@ -137,7 +137,6 @@ impl DbRunArgs {
        // receive SIGINT.
        tokio::select! {
            _ = db_instance.wait_for_shutdown() => {
-                db_instance.cleanup().await.context("clean up after shutdown")?;


I removed this line because wait_for_shutdown already is calling cleanup

smklein · 2024-10-25T19:47:02Z

test-utils/src/dev/db.rs

@@ -593,24 +609,43 @@ impl CockroachInstance {
                .await;
        }
        self.child_process = None;
+
+        // It shouldn't really matter which cleanup API we use, since
+        // the child process is gone anyway.
        self.cleanup().await


(See above comment) this is where wait_for_shutdown already calls cleanup

davepacheco

Nice! I only wonder if db-dev ought to shut down gracefully but it seems like not a big deal either way.

Cut a few seconds off of every test using CRDB

0ca5eff

smklein marked this pull request as ready for review October 25, 2024 19:44

smklein requested review from davepacheco and iliana October 25, 2024 19:44

smklein commented Oct 25, 2024

View reviewed changes

Merge branch 'main' into optimize-crdb-for-tests

cd44de1

iliana approved these changes Oct 26, 2024

View reviewed changes

davepacheco approved these changes Oct 26, 2024

View reviewed changes

smklein merged commit dceed97 into main Oct 27, 2024
16 checks passed

smklein deleted the optimize-crdb-for-tests branch October 27, 2024 00:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cut a few seconds off of every test using Cockroach #6934

Cut a few seconds off of every test using Cockroach #6934

smklein commented Oct 25, 2024 •

edited

Loading

smklein Oct 25, 2024

smklein Oct 25, 2024

davepacheco left a comment

Cut a few seconds off of every test using Cockroach #6934

Cut a few seconds off of every test using Cockroach #6934

Conversation

smklein commented Oct 25, 2024 • edited Loading

Before

After

smklein Oct 25, 2024

Choose a reason for hiding this comment

smklein Oct 25, 2024

Choose a reason for hiding this comment

davepacheco left a comment

Choose a reason for hiding this comment

smklein commented Oct 25, 2024 •

edited

Loading