Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s restart can fail while job being canceled #153

Open
johnbradley opened this issue Feb 7, 2019 · 1 comment
Open

k8s restart can fail while job being canceled #153

johnbradley opened this issue Feb 7, 2019 · 1 comment

Comments

@johnbradley
Copy link
Collaborator

If a user cancels a job and then immediately restarts before the k8s cluster has finished cleaning up the various pvcs/configmaps/jobs the restart will error.

@johnbradley johnbradley added this to the bespin-k8s milestone Mar 7, 2019
@dleehr
Copy link
Member

dleehr commented Mar 7, 2019

As described, canceling a job through bespin-api sends a message through the queue to lando, instructing it to cancel the job and tear down kubernetes resources.

lando marks the job as CANCELED as soon as it has made those API calls to kubernetes, but the resources may not be deleted immediately. Since the job is in CANCELED state too early, the user may attempt to restart it. The restart will fail because lando will attempt to create more resources with the same names, and kubernetes does not allow those conflicts.

This is somewhat of an edge case and is recoverable - sounds like if the user simply waits a few minutes for the state to settle down, their job can be restarted. So it's not a high priority issue but there is some room to better model the behavior.

@johnbradley johnbradley removed this from the bespin-k8s milestone Mar 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants