-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blog: Dynamic Scaling with dataloss #468
Conversation
Debatable if we want to merge this as is, we need to repeat the experiment eventually |
Need to check tomorrow on phone it is too small :D but yes even negative results we should publish 👍 Great that you did this 🚀😀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @oleschoenburg similar remarks as on the previous post, I'm also not 100% sure whether it is completely done, I miss some result statement or explanation of the result. :)
chaos-days/blog/2023-12-18-Dynamic-Scaling-with-Dataloss/index.md
Outdated
Show resolved
Hide resolved
chaos-days/blog/2023-12-18-Dynamic-Scaling-with-Dataloss/index.md
Outdated
Show resolved
Hide resolved
chaos-days/blog/2023-12-18-Dynamic-Scaling-with-Dataloss/index.md
Outdated
Show resolved
Hide resolved
chaos-days/blog/2023-12-18-Dynamic-Scaling-with-Dataloss/index.md
Outdated
Show resolved
Hide resolved
chaos-days/blog/2023-12-18-Dynamic-Scaling-with-Dataloss/index.md
Outdated
Show resolved
Hide resolved
chaos-days/blog/2023-12-18-Dynamic-Scaling-with-Dataloss/index.md
Outdated
Show resolved
Hide resolved
f7c6f79
to
2214226
Compare
@Zelldon I've updated the post significantly and described the edge case we ran into. Not sure if you want to take another look or not :) |
I will look at it tomorrow 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @oleschoenburg cool stuff! Happy that we even found something to improve. I had some follow-up questions :)
One goal is to verify that we haven't accidentally introduced a single point of failure in the cluster. | ||
Another is to ensure that data loss does not corrupt the cluster topology. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏼
|
||
## Dataloss on the Coordinator | ||
|
||
Zeebe uses Broker 0 as the coordinator for changes to the cluster topology. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔧
Zeebe uses Broker 0 as the coordinator for changes to the cluster topology. | |
Zeebe uses Broker 0 by default as the coordinator for changes to the cluster topology. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not configurable yet
After starting the operation with `zbchaos cluster scale --brokers 6` we see that the operation has started. | ||
We then trigger dataloss on the coordinator with `zbchaos broker dataloss delete --nodeId 0`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔧 How do you see this? Metrics/ logs, something you want to share here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't take screenshots for that part unfortunately. I think it's okay though - the interesting bit is what happens after.
With 4 members, the quorum is 3, meaning that the partition can only elect a leader and process if at least 3 members are available. | ||
In our experiment, we made the coordinator unavailable, so we were already down to 3 members. | ||
Additionally, the newly joining member did not start yet because it was waiting for a successful join response from the leader. | ||
The newly joining member never received such a response because we took down the coordinator which was the leader of the partition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ Based on the screenshot Zeebe-1 was leader? Or is it always the case, that the coordinator will be leader for the scaling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch 👍 coordinator should have been leader if everything was balanced. I'll remove this sentence, it's not really relevant.
|
||
We want to improve this behavior in the future by allowing responding earlier to join requests. | ||
Currently, a response is only sent after leaving the joint consensus phase. | ||
To reduce the likelihood that the newly joining member is unavailable because it waits for a response, we can send the join response earlier, right after entering the joint consensus phase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ but this still can end up in the same situation right? Is there a way to completely avoid this?
Is this problem only because of the datalass so problematic or would this also happen on normal restarts during scaling? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's not completely avoided, just the chance is reduced. To avoid it completely is too tricky for now, but it's also not critical since we can assume that eventually the missing broker will come back.
The problem is not dataloss, it's just unavailability.
I've adjusted a few things and tried to incorporate your suggestions. I think this is good enough to merge now. |
Closes #466