Skip to content

Implement SKIP LOCKED during action claiming #1259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 20, 2025

Conversation

WPprodigy
Copy link
Member

@WPprodigy WPprodigy commented Mar 28, 2025

Avoiding SQL deadlocks during action claiming

When running multiple action scheduler queues concurrently, many will start to fatal error due to SQL deadlocks. I have found that this always stems from the action claims query. This query ends up scanning many rows, and being an UPDATE, it ends up creating locks on the table/rows. This results in concurrent claims queries conflicting with each other and deadlocking. It also causes deadlocks during individual action status updates from other queues that are already processing actions. This leaves actions in unknown states and results in lots needing to be failed later on due to timeouts during the cleanup periods.

This PR prevents the deadlocks altogether. The magic here comes from the SKIP LOCKED SQL feature: https://dev.mysql.com/blog-archive/mysql-8-0-1-using-skip-locked-and-nowait-to-handle-hot-rows/. It's a perfect fit for the claims query, because we don't really care if it needs to skip over a few rows. All we want is for it to claim actions without conflict.

The SKIP LOCKED has to come from a SELECT query. So the new query here uses a SELECT sub-query (w/ FOR UPDATE to claim a lock) and then JOINs the results to accomplish first selecting the items and then updating them all within a single transaction.

I've done a lot of testing with this and ran into no deadlocks or failing actions. I've run hundreds of thousands of actions at 250 queue concurrency without any signs of trouble.

BEFORE (causes deadlocks):

UPDATE wp_actionscheduler_actions SET claim_id=4322, last_attempt_gmt='2025-01-15 03:24:52', last_attempt_local='2025-01-14 21:24:53' WHERE claim_id = 0 AND scheduled_date_gmt <= '2025-01-15 03:23:18' AND status='pending' ORDER BY priority ASC, attempts ASC, scheduled_date_gmt ASC, action_id ASC LIMIT 20;

AFTER (no deadlocks):

UPDATE wp_actionscheduler_actions t1 JOIN ( SELECT action_id from wp_actionscheduler_actions WHERE claim_id = 0 AND scheduled_date_gmt <= '2025-01-15 03:24:52' AND status='pending' ORDER BY priority ASC, attempts ASC, scheduled_date_gmt ASC, action_id ASC LIMIT 20 FOR UPDATE SKIP LOCKED ) t2 ON t1.action_id = t2.action_id SET claim_id=4322, last_attempt_gmt='2025-01-15 03:24:52', last_attempt_local='2025-01-14 21:24:53';

Considerations

I'm not necessarily suggesting we merge this PR as is (leaving as a draft for now), but I wanted to show the solution here and what I've found. We'll need to test and discover database compatibility before shipping this as the new default behavior.

SKIP LOCKED shipped with MySQL 8.0.1 (2017-04-10) and MariaDB in 10.6.0 (2021-4-26), and works only for InnoDB tables (ignored otherwise). I've only done the testing in MySQL only, and I'm not sure if it's safely ignored on the older versions or if it would cause issues.

I do think this would definitely be a worthwhile change to get into the plugin, as I've seen it be a source of a lot of issues. Some possible rollout options:

A) Put it into a new method (claim_actions_v2()), and use the new method when a feature flag is present via some constant or filter.
B) Check the database version, and automatically use the v2 method when it's modern enough.

@prettyboymp prettyboymp self-assigned this Apr 7, 2025
}
if ( property_exists( $wp_object_cache, 'cache' ) ) {
$wp_object_cache->cache = array();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes aren't directly related to this PR, but the CLI commands being spawned when testing with https://github.com/woocommerce/as-pressure-cooker were causing a bunch of errors/warnings.

@@ -449,7 +449,7 @@ public function test_as_recover_from_incorrect_schema() {
$this->assertContains( 'Caught exception while enqueuing action "hook_18": Error saving action', $logged_errors );
$this->assertContains( 'Caught exception while enqueuing action "hook_19": Error saving action', $logged_errors );
$this->assertContains( 'Caught exception while enqueuing action "hook_20": Error saving action', $logged_errors );
$this->assertContains( "Unknown column 'priority' in 'field list'", $logged_errors );
$this->assertContains( "Unknown column 'priority' in ", $logged_errors );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format of this error is slightly different across supported DBs.

@prettyboymp
Copy link
Contributor

prettyboymp commented Apr 7, 2025

I've done some testing against this and am happy with the results. So I've made some additional changes to deal with compatibility:

  • Claim process always use the UPDATE .... JOIN (SELECT ... FOR UPDATE ... WHERE.... ORDER BY...) format (without SKIP LOCKED if not supported) rather than the previous UPDATE... WHERE... ORDER BY....

    This alone should get rid of most deadlocks since the latter is non-deterministic because the db doesn't lock the rows until attempting to update them when another process may have already locked a subset of the rows. By moving to a SELECT .. FOR UPDATE the locks are set during the select process and claimed in a known order. This can still result in long running lock waits for claim queries, but they should at least not result in deadlocks.

  • Use the SKIP LOCKED modifier on the the sub-select only on DB versions that support it.

    This should further improve performance for DB versions that support it by preventing the claim process from stacking caused by multiple concurrent processes attempting to wait on locks to clear by just continuing with the next records.

Testing Instructions

  1. Setup a site running Action Scheduler - it can be installed as a separate plugin. If you're running other plugins with ActionScheduler, you'll need to either disable those or apply the following patch to make sure this version loads:
diff --git a/action-scheduler.php b/action-scheduler.php
index ef12ca2..8d04084 100644
--- a/action-scheduler.php
+++ b/action-scheduler.php
@@ -5,8 +5,8 @@
  * Description: A robust scheduling library for use in WordPress plugins.
  * Author: Automattic
  * Author URI: https://automattic.com/
- * Version: 3.9.2
- * License: GPLv3
+ * Version: 3.9.3
+ * License: GPLv
  * Requires at least: 6.5
  * Tested up to: 6.7
  * Requires PHP: 7.1
@@ -29,29 +29,29 @@
  * @package ActionScheduler
  */

-if ( ! function_exists( 'action_scheduler_register_3_dot_9_dot_2' ) && function_exists( 'add_action' ) ) { // WRCS: DEFINED_VERSION.
+if ( ! function_exists( 'action_scheduler_register_3_dot_9_dot_3' ) && function_exists( 'add_action' ) ) { // WRCS: DEFINED_VERSION.

        if ( ! class_exists( 'ActionScheduler_Versions', false ) ) {
                require_once __DIR__ . '/classes/ActionScheduler_Versions.php';
                add_action( 'plugins_loaded', array( 'ActionScheduler_Versions', 'initialize_latest_version' ), 1, 0 );
        }

-       add_action( 'plugins_loaded', 'action_scheduler_register_3_dot_9_dot_2', 0, 0 ); // WRCS: DEFINED_VERSION.
+       add_action( 'plugins_loaded', 'action_scheduler_register_3_dot_9_dot_3', 0, 0 ); // WRCS: DEFINED_VERSION.

        // phpcs:disable Generic.Functions.OpeningFunctionBraceKernighanRitchie.ContentAfterBrace
        /**
         * Registers this version of Action Scheduler.
         */
-       function action_scheduler_register_3_dot_9_dot_2() { // WRCS: DEFINED_VERSION.
+       function action_scheduler_register_3_dot_9_dot_3() { // WRCS: DEFINED_VERSION.
                $versions = ActionScheduler_Versions::instance();
-               $versions->register( '3.9.2', 'action_scheduler_initialize_3_dot_9_dot_2' ); // WRCS: DEFINED_VERSION.
+               $versions->register( '3.9.3', 'action_scheduler_initialize_3_dot_9_dot_3' ); // WRCS: DEFINED_VERSION.
        }

        // phpcs:disable Generic.Functions.OpeningFunctionBraceKernighanRitchie.ContentAfterBrace
        /**
         * Initializes this version of Action Scheduler.
         */
-       function action_scheduler_initialize_3_dot_9_dot_2() { // WRCS: DEFINED_VERSION.
+       function action_scheduler_initialize_3_dot_9_dot_3() { // WRCS: DEFINED_VERSION.
                // A final safety check is required even here, because historic versions of Action Scheduler
                // followed a different pattern (in some unusual cases, we could reach this point and the
                // ActionScheduler class is already defined—so we need to guard against that).
@@ -63,7 +63,7 @@ if ( ! function_exists( 'action_scheduler_register_3_dot_9_dot_2' ) && function_

        // Support usage in themes - load this version if no plugin has loaded a version yet.
        if ( did_action( 'plugins_loaded' ) && ! doing_action( 'plugins_loaded' ) && ! class_exists( 'ActionScheduler', false ) ) {
-               action_scheduler_initialize_3_dot_9_dot_2(); // WRCS: DEFINED_VERSION.
+               action_scheduler_initialize_3_dot_9_dot_3(); // WRCS: DEFINED_VERSION.
                do_action( 'action_scheduler_pre_theme_init' );
                ActionScheduler_Versions::initialize_latest_version();
        }
  1. Install AS Pressure Cooker as a plugin. This PR has some tweaks/fixes for PHP8 and improved details on the report: https://github.com/woocommerce/as-pressure-cooker/pull/6
  2. Watch your site's error_log and slow query log (if available).
  3. Run the AS Pressure Cooker to test the changes:
wp action-scheduler pressure-cooker --starting_queue_runners=1 --finishing_queue_runners=30 --initial_past_due_actions=60000

Notes when testing:

  • The pressure cooker will start with --starting_queue_runners and end with --finishing_queue_runners, allowing you to see where errors start occurring, if any.
  • You may be limited in how many runners you can test with based on the memory and connection limits of your environment. So it may be a good idea to start with a lower number to see where your environment breaks down and then run a full test just below that limit.
  • --initial_past_due_actions determines how many actions are created prior to starting the test. More will be generated during the test, however, that generation may not be able to keep up with the processing. So it is a good idea to set this high enough to make sure the processing rate isn't brought down by there not being enough actions to be process.

Testing Results

I ran the following tests locally with the below diff applied to better simulate how action scheduler runs normally through requests and the limitation of an assumed 30s request timeout and forcing the clean up queries to be run.

diff --git a/classes/WP_CLI/ActionScheduler_WPCLI_QueueRunner.php b/classes/WP_CLI/ActionScheduler_WPCLI_QueueRunner.php
index 08cb0cd..150b470 100644
--- a/classes/WP_CLI/ActionScheduler_WPCLI_QueueRunner.php
+++ b/classes/WP_CLI/ActionScheduler_WPCLI_QueueRunner.php
@@ -60,6 +60,9 @@ class ActionScheduler_WPCLI_QueueRunner extends ActionScheduler_Abstract_QueueRu
 	 * @throws \WP_CLI\ExitException When there are too many concurrent batches.
 	 */
 	public function setup( $batch_size, $hooks = array(), $group = '', $force = false ) {
+		if( $this->get_execution_time() >= 28 ) {
+			return 0;
+		}
 		$this->run_cleanup();
 		$this->add_hooks();
 
@@ -120,6 +123,10 @@ class ActionScheduler_WPCLI_QueueRunner extends ActionScheduler_Abstract_QueueRu
 
 			$this->process_action( $action_id, $context );
 			$this->progress_bar->tick();
+
+			if( $this->get_execution_time() >= 28 ) {
+				break;
+			}
 		}
 
 		$completed = $this->progress_bar->current();

This PR - MySQL 8.0.4 - InnoDB Engine

wp action-scheduler pressure-cooker --starting_queue_runners=30 --finishing_queue_runners=30 --initial_past_due_actions=60000

Deleting existing actions and claims.
Generating test actions  100% [===================================================================================================] 1:34 / 1:35
Created 60000 past due actions.

Elapsed:              120s (100%)
Generated actions:    64400
Processed actions:    62180
Processing Rate:      518.2/s
Active queue runners: 28/30
Replacement rate      6/28
Database errors:      none detected

Elapsed:              120s (100%)
Generated actions:    64343
Processed actions:    52066
Processing Rate:      433.9/s
Active queue runners: 30/30
Replacement rate      0/30
Database errors:      none detected

Test completed.

TRUNK - MySQL 8.0.4 - InnoDB Engine

wp action-scheduler pressure-cooker --starting_queue_runners=30 --finishing_queue_runners=30 --initial_past_due_actions=60000

Deleting existing actions and claims.
Generating test actions  100% [===================================================================================================] 1:29 / 1:29
Created 60000 past due actions.

Elapsed:              120s (100%)
Generated actions:    64400
Processed actions:    31006 (654 fails)
Processing Rate:      258.4/s
Active queue runners: 29/30
Replacement rate      0/29
Database errors:      766 (766 deadlocks)

Test completed.

This PR - MySQL 8.0.4 - MyISAM Engine (SKIP LOCKED ignored)

This was really just run to verify no issues were caused on systems using MyISAM engines.

wp action-scheduler pressure-cooker --starting_queue_runners=30 --finishing_queue_runners=30 --initial_past_due_actions=60000

Deleting existing actions and claims.
Generating test actions  100% [===================================================================================================] 1:29 / 1:29
Created 60000 past due actions.

Elapsed:              120s (100%)
Generated actions:    64324
Processed actions:    16707
Processing Rate:      139.2/s
Active queue runners: 30/30
Replacement rate      0/30
Database errors:      none detected

Test completed.

@prettyboymp prettyboymp marked this pull request as ready for review April 7, 2025 19:49
Comment on lines 1031 to 1038
/**
* Determines whether the database supports using SKIP LOCKED.
*
* SKIP_LOCKED support was added to MariaDB in 10.6.0 and to MySQL in 8.0.1
*
* @return bool
*/
private function db_supports_skip_locked() {
Copy link
Member Author

@WPprodigy WPprodigy Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should maybe cache this check so it doesn't have to query and confirm the version support before every claim. Can do local runtime cache, but I'd prolly just do object cache with some relatively short TTL like 5 minutes even.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WPprodigy, I don't think that is necessary. My mysqli::$server_info is a read-only property of the underlying mysqli object. I can't find documentation explicitly saying it, but I believe it's set when the connection is made or at the very least would be saved within the connected mysqli object property between calls. https://www.php.net/manual/en/class.mysqli.php

@prettyboymp prettyboymp requested review from a team and albarin and removed request for a team April 11, 2025 19:11
@prettyboymp prettyboymp requested review from a team and kalessil and removed request for a team and albarin May 2, 2025 13:03
Copy link

@kalessil kalessil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to dive in this PR: leaving some questions in comments

@prettyboymp prettyboymp requested a review from kalessil May 13, 2025 13:58
Copy link

@kalessil kalessil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the answers @prettyboymp! I'll do the second review round.
Can you add some testing steps (I guess with the pressure cooker) so I can try them in my environment?

@prettyboymp
Copy link
Contributor

Thanks for all the answers @prettyboymp! I'll do the second review round. Can you add some testing steps (I guess with the pressure cooker) so I can try them in my environment?

Sorry, the testing instructions on in this comment: #1259 (comment)

@prettyboymp prettyboymp requested a review from kalessil May 19, 2025 12:47
Copy link

@kalessil kalessil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I tried my best running the tests, but the pressure cooker is not working properly under our default wp-env cli-container:

Deleting existing actions and claims.
Generating test actions  100% [===================================================================================================================================================================================================================================================================] 1:26 / 1:26
Created 60000 past due actions.

Elapsed:              120s (100%)
Generated actions:    60000
Processed actions:    0
Processing Rate:      0/s
Active queue runners: 29/30
Replacement rate      15/29
Database errors:      none detected

Hence I'll rely to passing CI and not hold this PR further.

@prettyboymp prettyboymp merged commit 240c108 into trunk May 20, 2025
206 of 208 checks passed
@prettyboymp prettyboymp deleted the update/prevent-action-claiming-sql-deadlocks branch May 20, 2025 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow claim actions update query causing lock wait timeouts
3 participants