New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Blog: Dynamic Scaling with dataloss #468

Merged

lenaschoenburg merged 2 commits into main from os/dynamic-scaling-dataloss

Dec 22, 2023

Member

lenaschoenburg commented Dec 18, 2023

Closes #466

lenaschoenburg requested a review from ChrisKujawa as a code owner

December 18, 2023 17:39

Member Author

lenaschoenburg commented Dec 18, 2023

Debatable if we want to merge this as is, we need to repeat the experiment eventually

Member

ChrisKujawa commented Dec 18, 2023

Need to check tomorrow on phone it is too small :D but yes even negative results we should publish 👍

Great that you did this 🚀😀

Base automatically changed from os/support-dynamic-scaling to main

December 19, 2023 08:39

ChrisKujawa reviewed

View reviewed changes

Member

ChrisKujawa left a comment

Thanks @oleschoenburg similar remarks as on the previous post, I'm also not 100% sure whether it is completely done, I miss some result statement or explanation of the result. :)

chaos-days/blog/2023-12-18-Dynamic-Scaling-with-Dataloss/index.md Outdated Show resolved Hide resolved

chaos-days/blog/2023-12-18-Dynamic-Scaling-with-Dataloss/index.md Outdated Show resolved Hide resolved

chaos-days/blog/2023-12-18-Dynamic-Scaling-with-Dataloss/index.md Outdated Show resolved Hide resolved

chaos-days/blog/2023-12-18-Dynamic-Scaling-with-Dataloss/index.md Outdated Show resolved Hide resolved

chaos-days/blog/2023-12-18-Dynamic-Scaling-with-Dataloss/index.md Outdated Show resolved Hide resolved

chaos-days/blog/2023-12-18-Dynamic-Scaling-with-Dataloss/index.md Outdated Show resolved Hide resolved


          feat(blog): dynamic scaling with dataloss on coordinator

lenaschoenburg force-pushed the os/dynamic-scaling-dataloss branch from f7c6f79 to 2214226 Compare

December 20, 2023 17:14

Member Author

lenaschoenburg commented Dec 20, 2023

@Zelldon I've updated the post significantly and described the edge case we ran into. Not sure if you want to take another look or not :)

Member

ChrisKujawa commented Dec 20, 2023

I will look at it tomorrow 👍

ChrisKujawa approved these changes

View reviewed changes

Member

ChrisKujawa left a comment

Thanks @oleschoenburg cool stuff! Happy that we even found something to improve. I had some follow-up questions :)

chaos-days/blog/2023-12-19-Dynamic-Scaling-with-Dataloss/index.md

Comment on lines +18 to +19

		One goal is to verify that we haven't accidentally introduced a single point of failure in the cluster.
		Another is to ensure that data loss does not corrupt the cluster topology.

Member

ChrisKujawa Dec 22, 2023

👍🏼

chaos-days/blog/2023-12-19-Dynamic-Scaling-with-Dataloss/index.md


		## Dataloss on the Coordinator

		Zeebe uses Broker 0 as the coordinator for changes to the cluster topology.

Member

ChrisKujawa Dec 22, 2023

🔧

Suggested change

      
            Zeebe uses Broker 0 as the coordinator for changes to the cluster topology.
          
            Zeebe uses Broker 0 by default as the coordinator for changes to the cluster topology.

Member Author

lenaschoenburg Dec 22, 2023

It's not configurable yet

chaos-days/blog/2023-12-19-Dynamic-Scaling-with-Dataloss/index.md Show resolved Hide resolved

chaos-days/blog/2023-12-19-Dynamic-Scaling-with-Dataloss/index.md Outdated

Comment on lines 48 to 49

		After starting the operation with `zbchaos cluster scale --brokers 6` we see that the operation has started.
		We then trigger dataloss on the coordinator with `zbchaos broker dataloss delete --nodeId 0`.

Member

ChrisKujawa Dec 22, 2023

🔧 How do you see this? Metrics/ logs, something you want to share here?

Member Author

lenaschoenburg Dec 22, 2023

I didn't take screenshots for that part unfortunately. I think it's okay though - the interesting bit is what happens after.

chaos-days/blog/2023-12-19-Dynamic-Scaling-with-Dataloss/index.md Outdated

+              With 4 members, the quorum is 3, meaning that the partition can only elect a leader and process if at least 3 members are available.
+              In our experiment, we made the coordinator unavailable, so we were already down to 3 members.
+              Additionally, the newly joining member did not start yet because it was waiting for a successful join response from the leader.
+              The newly joining member never received such a response because we took down the coordinator which was the leader of the partition.

Member

ChrisKujawa Dec 22, 2023

❓ Based on the screenshot Zeebe-1 was leader? Or is it always the case, that the coordinator will be leader for the scaling?

Member Author

lenaschoenburg Dec 22, 2023

Ah good catch 👍 coordinator should have been leader if everything was balanced. I'll remove this sentence, it's not really relevant.

chaos-days/blog/2023-12-19-Dynamic-Scaling-with-Dataloss/index.md Outdated

+              We want to improve this behavior in the future by allowing responding earlier to join requests.
+              Currently, a response is only sent after leaving the joint consensus phase.
+              To reduce the likelihood that the newly joining member is unavailable because it waits for a response, we can send the join response earlier, right after entering the joint consensus phase.

Member

ChrisKujawa Dec 22, 2023

❓ but this still can end up in the same situation right? Is there a way to completely avoid this?

Is this problem only because of the datalass so problematic or would this also happen on normal restarts during scaling? 🤔

Member Author

lenaschoenburg Dec 22, 2023

Yeah, it's not completely avoided, just the chance is reduced. To avoid it completely is too tricky for now, but it's also not critical since we can assume that eventually the missing broker will come back.
The problem is not dataloss, it's just unavailability.


          refactor(blog): clarify dataloss steps and result

c2a31c1

Member Author

lenaschoenburg commented Dec 22, 2023

I've adjusted a few things and tried to incorporate your suggestions. I think this is good enough to merge now.

lenaschoenburg merged commit d785101 into main

lenaschoenburg deleted the os/dynamic-scaling-dataloss branch

December 22, 2023 14:03

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet