Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog: Dynamic Scaling with dataloss #468

Merged
merged 2 commits into from
Dec 22, 2023
Merged

Conversation

lenaschoenburg
Copy link
Member

Closes #466

@lenaschoenburg
Copy link
Member Author

Debatable if we want to merge this as is, we need to repeat the experiment eventually

@ChrisKujawa
Copy link
Member

Need to check tomorrow on phone it is too small :D but yes even negative results we should publish 👍

Great that you did this 🚀😀

Base automatically changed from os/support-dynamic-scaling to main December 19, 2023 08:39
Copy link
Member

@ChrisKujawa ChrisKujawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @oleschoenburg similar remarks as on the previous post, I'm also not 100% sure whether it is completely done, I miss some result statement or explanation of the result. :)

@lenaschoenburg lenaschoenburg force-pushed the os/dynamic-scaling-dataloss branch from f7c6f79 to 2214226 Compare December 20, 2023 17:14
@lenaschoenburg
Copy link
Member Author

@Zelldon I've updated the post significantly and described the edge case we ran into. Not sure if you want to take another look or not :)

@ChrisKujawa
Copy link
Member

I will look at it tomorrow 👍

Copy link
Member

@ChrisKujawa ChrisKujawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @oleschoenburg cool stuff! Happy that we even found something to improve. I had some follow-up questions :)

Comment on lines +18 to +19
One goal is to verify that we haven't accidentally introduced a single point of failure in the cluster.
Another is to ensure that data loss does not corrupt the cluster topology.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼


## Dataloss on the Coordinator

Zeebe uses Broker 0 as the coordinator for changes to the cluster topology.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧

Suggested change
Zeebe uses Broker 0 as the coordinator for changes to the cluster topology.
Zeebe uses Broker 0 by default as the coordinator for changes to the cluster topology.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not configurable yet

Comment on lines 48 to 49
After starting the operation with `zbchaos cluster scale --brokers 6` we see that the operation has started.
We then trigger dataloss on the coordinator with `zbchaos broker dataloss delete --nodeId 0`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 How do you see this? Metrics/ logs, something you want to share here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't take screenshots for that part unfortunately. I think it's okay though - the interesting bit is what happens after.

With 4 members, the quorum is 3, meaning that the partition can only elect a leader and process if at least 3 members are available.
In our experiment, we made the coordinator unavailable, so we were already down to 3 members.
Additionally, the newly joining member did not start yet because it was waiting for a successful join response from the leader.
The newly joining member never received such a response because we took down the coordinator which was the leader of the partition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Based on the screenshot Zeebe-1 was leader? Or is it always the case, that the coordinator will be leader for the scaling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch 👍 coordinator should have been leader if everything was balanced. I'll remove this sentence, it's not really relevant.


We want to improve this behavior in the future by allowing responding earlier to join requests.
Currently, a response is only sent after leaving the joint consensus phase.
To reduce the likelihood that the newly joining member is unavailable because it waits for a response, we can send the join response earlier, right after entering the joint consensus phase.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ but this still can end up in the same situation right? Is there a way to completely avoid this?

Is this problem only because of the datalass so problematic or would this also happen on normal restarts during scaling? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not completely avoided, just the chance is reduced. To avoid it completely is too tricky for now, but it's also not critical since we can assume that eventually the missing broker will come back.
The problem is not dataloss, it's just unavailability.

@lenaschoenburg
Copy link
Member Author

I've adjusted a few things and tried to incorporate your suggestions. I think this is good enough to merge now.

@lenaschoenburg lenaschoenburg merged commit d785101 into main Dec 22, 2023
@lenaschoenburg lenaschoenburg deleted the os/dynamic-scaling-dataloss branch December 22, 2023 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hypothesis: Broker scaling survives dataloss on coordinator node
2 participants