Skip to content

Address deadlocking and performance issues around existing indexes and claiming process. #1251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 17 commits into
base: trunk
Choose a base branch
from

Conversation

prettyboymp
Copy link
Contributor

@prettyboymp prettyboymp commented Mar 25, 2025

This PR is a combination of fixes that will be broken into smaller, more specific PRs. I combined them into one for initial review/testing as all of the issues together contributed to deadlocking and performance problems that limit the ability to increase the rate which actions can be processed. Testing while fixing only one of these issues at a time still caused the system to bog down enough to prevent a clear understanding of the improvements.

Main issues being addressed:

Reduce the the deadlock frequency caused by claiming pending actions.

Ref #1104

The atomic query used to claim actions frequently causes deadlocks on the actions table due to the amount of time it takes to apply the updates to the table. The best way to reduce this is to get rid of any sorting needed as it limits the records that end up being locked during the update.

I'm proposing two different options for addressing this:

"priority list"

This proposes that we change the philosophy around claiming actions so that attempts, scheduled_date_gmt, or action_id are no longer relevant when claiming actions. Only priority is considered. So actions with the same priority may execute in any order.

Then, to fully remove the order by clause, we will query the unique priorities that are pending and loop through those in order to claim actions of each priority until our Queue is filled.

To further improve this update query, the KEY claim_id_status_priority_scheduled_date_gmt was added to the wp_actionscheduler_actions table.

Downsides:

  • Removes the action_scheduler_claim_actions_order_by
  • Changes the philosophy of prioritizing claims - may have some unexpected consequences if extensions assume schedule date dictates order.
  • The query to retrieve active priorities can be slow.
  • Potential for a loop of 255 queries.

"pre query actions"

This is the more backward compatible change. It simply makes a select query to get a set of action_ids to work with prior to running the update to claim the actions. The update query still applies the same where clauses, so it remains atomic and avoids more than one claim running actions, but limits the records that become locked by the action_ids returned from the prequery.

I decided to try this late, so I haven't tested it as much as the priority list option. But after some initial tests, it seems more promising and has less downside.

Downsides:

  • There isn't a great way to index to deal with the where and order by clauses in the pre-select query - but that is at least a select query and not the update.
  • Race conditions of multiple instances attempting to claim the same actions can cause one end up without any. This can probably be addressed with re-attempts if a set of action_ids are pre-queried, but no updates occur.

[ ] - Need to keep the claim_id_status_scheduled_date_gmt index if this is the path chosen.

Stop resetting the claim_id of completed actions.

Ref #1105

Many queries used by ActionScheduler apply claim_id filters. However, the cardinality of this column is extremely low because all actions, outside of those with actively running claims, have a claim_id of 0. The current behavior resets even completed actions back to a claim_id of 0. As a result, the MySQL optimizer will often choose to use indexes that aren't as efficient because it's estimations don't show that the indexes with the claim_id column will be as efficient.

To fix this, only actions that are still pending will get set back to a claim_id of 0 after a claim queue has been processed. This allows.

Improve the query performance of QueueCleaner::reset_timeouts()

This adds a usable index (status_last_attempt_gmt) for the query used in QueueCleaner::reset_timeouts(). The orderby argument was also updated to modified to help force the use of the index. During testing, the optimizer would often use status_claim_id based indexes which always performed much worse. This also helps make sure the oldest claimed actions are reset first.

Improve the query performance of calculating the current number of claims running.

Adds a key on status_claim_id for the query used to determine the number of claims already running when deciding whether to spawn more.

Testing

I added the following to a file in mu-plugins to allow testing a heavier load on ActionScheduler:

// Switch between these to toggle between implementations for load testing.  Comment out both to compare to the original implementation.
define('CLAIM_STAKING_METHOD', 'priority_list');
//define('CLAIM_STAKING_METHOD', 'prequery');

/**
 * Bump up the batch size to 50 actions at a time.
 */
add_filter( 'action_scheduler_queue_runner_batch_size', function ( $size ) {
	return 50;
} );
/**
 * Allow up to 5 concurrent queues/claims to run at a time.
 */
add_filter( 'action_scheduler_queue_runner_concurrent_batches', function ( $count ) {
	return 5;
} );


/**
 * Actions used for testing.
 */
function schedule_random_action() {

	$foo         = rand( 0, 10000 );
	$priority    = rand( 0, 255 );
	$time_offset = rand( 1, 24 ) * HOUR_IN_SECONDS;

	return as_schedule_single_action( time() + $time_offset, 'as_performance_testing_action', array( 'foo' => $foo ), '', false, $priority );
}

// The action callback hook will reschedule another action instance in the future to keep the load up.
add_action( 'as_performance_testing_action', function() {
	// comment this line out to have the action complete successfully but not fail.
	schedule_random_action();
});


if ( class_exists( 'WP_CLI' ) ) {
	/**
	 * Generate 50K actions for testing.  These are no-op callbacks so that we're testing the load of scheduling, not the actions themselves.
	 *
	 * Example:
	 *  wp generate_actions
	 */
	WP_CLI::add_command(
		'generate_actions',
		function () {
			if ( ! did_action( 'action_scheduler_init' ) ) {
				WP_CLI::error( 'ActionScheduler is not initialized!!!' );
			}

			global $wpdb;
			WP_CLI::line( ' current actions: ' . $wpdb->get_var( "SELECT COUNT(*) FROM wp_actionscheduler_actions" ) );

			$num_to_create = 50000;

			for ( $i = 0; $i < $num_to_create; $i ++ ) {
				if ( ! schedule_random_action() ) {
					WP_CLI::warning( 'action not created' );
				}
			}
			WP_CLI::success( "Created $num_to_create actions" );
		}
	);
}
  • You can add actions to the queue using wp generate_actions. This will add 50K pending actions that are randomly spread out across the next 24 hours. When each is run, it will reschedule itself again randomly within the next 24 hour period.
  • Watch error logs for deadlocks and engine status with the database.
  1. Use the wp generate_actions command to generate a lot of actions. I was testing against continually processing ~1M in a 24 hour period.
  2. Open multiple instances of the WP-Admin in the browser to increase the number of polling ajax requests that may trigger the scheduler queue.
  3. Adjust the action_scheduler_queue_runner_concurrent_batches filter to set different levels of load and observe the point where deadlocks start to appear, if at all.
  4. Modify the CLAIM_STAKING_METHOD constant to change the algorithm used for claiming.

$error = empty( $wpdb->last_error )
? _x( 'unknown', 'database error', 'action-scheduler' )
: $wpdb->last_error;
$use_priorities = true;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this variable and flow clause below specifically for simplifying comparing the performance of the two operations.


throw new \RuntimeException(
sprintf(
foreach ( $pending_priorities as $pending_priority ) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This potentially loops over up to 255 priorities. However, in the real world, it's somewhat unlikely that we would have a spread of a full 255 different levels of priorities being scheduled. Even with the full potential range scheduled, testing has shown that it is still more efficient than the deadlocks caused by order by clauses.

If we find we need to reduce this range further, my first thoughts are to apply a filter when adding actions to limit the unique priorities that can be used. I don't see a need for 255 options.

@prettyboymp
Copy link
Contributor Author

I need to make some updates for PHP <7.4 compat.

…backwards compatibility by continuing to use the orderby filter.
@WPprodigy
Copy link
Member

I've found this to be a magic bullet: #1259 :)

@prettyboymp
Copy link
Contributor Author

I've found this to be a magic bullet: #1259 :)

@WPprodigy, thank you pointing this out. I'll add it to the options we compare when we take a deeper dive into this.

}
if ( property_exists( $wp_object_cache, 'cache' ) ) {
$wp_object_cache->cache = array();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes aren't directly related to this PR, but the CLI commands being spawned when testing with https://github.com/woocommerce/as-pressure-cooker were causing a bunch of errors/warnings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants