Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change partition_by to return a random partition instead of a nil value for a parition_by function #298

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

cblage
Copy link

@cblage cblage commented Jan 31, 2025

This fixes an issue we were facing where the partition_key being nil meant that since

in_partition?(%Subscriber{partition_key: nil}, _partition_key), do: false

Then any events with a nil partition_key would exhaust the available subscribers - draining them from other queues - and especially in our case where 99% of events were going to the nil queue - meaning this queue always get priority vs other queues in terms of consuming events. This happened due to in EventStore.Subscriptions.SubscriptionFsm.next_available_subscriber/2 this never resolving into an existing queue:

 partition_subscriber =
      Enum.find(subscribers, fn {_pid, subscriber} ->
        Subscriber.in_partition?(subscriber, partition_key)
      end)

which meant that events for the nil queue would always grab a new Subscriber instead of grouping up with an existing one already processing a nil queue event. This exhausted the subscribers and prevented other queues with less events from being processed.

This fix prevents the issue by replacing nil partition_keys with a random partition based on the max_size that will allow the event queues to be processed with equal priority when partition_by is nil or returns nil.

This nil behavior was kept for when partition_by is nil since it optimizes the parallelism of the "default" behavior when no partition_by is provided.

@cblage
Copy link
Author

cblage commented Feb 1, 2025

Sorry, need to fix the tests

@cblage cblage marked this pull request as draft February 1, 2025 01:44
@cblage cblage changed the title Change partition_by to return a random integer instead of a nil value Change partition_by to return a random partition instead of a nil value for a parition_by function Feb 3, 2025
@cblage cblage marked this pull request as ready for review February 3, 2025 17:52
@cblage
Copy link
Author

cblage commented Feb 3, 2025

I updated the changes a bit and fixed the tests

@drteeth
Copy link
Contributor

drteeth commented Feb 11, 2025

Hello @cblage, thanks for putting the work in here.

I see the problem, where we need to be able to identify that we'd like the default partition strategy to be used, and then a custom strategy that may return nil and thus make very uneven distributions.

I'm worried that returning a random partition might be surprising behaviour in the case of a nil.

Is there a reason you wouldn't handle this in your application code? You would opt into the fallback behaviour that makes the most sense for you.

I agree that this is a bit of a foot-gun here though. Perhaps we should raise an error if partition_by/1 returns nil? This would give the developer the chance to catch this issue early and address it before going to production.

WDYT?

@drteeth drteeth self-requested a review February 11, 2025 18:08
@cblage
Copy link
Author

cblage commented Feb 11, 2025

Hi @drteeth!

Throwing an error when partition_by returns nil might be a better approach. We dealt with the problem on the application side, but it was difficult to debug why certain events were not getting processed in time - so we decided to contribute with this patch to prevent others from having the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants