k8s restart can fail while job being canceled #153

johnbradley · 2019-02-07T17:26:20Z

If a user cancels a job and then immediately restarts before the k8s cluster has finished cleaning up the various pvcs/configmaps/jobs the restart will error.

dleehr · 2019-03-07T15:27:48Z

As described, canceling a job through bespin-api sends a message through the queue to lando, instructing it to cancel the job and tear down kubernetes resources.

lando marks the job as CANCELED as soon as it has made those API calls to kubernetes, but the resources may not be deleted immediately. Since the job is in CANCELED state too early, the user may attempt to restart it. The restart will fail because lando will attempt to create more resources with the same names, and kubernetes does not allow those conflicts.

This is somewhat of an edge case and is recoverable - sounds like if the user simply waits a few minutes for the state to settle down, their job can be restarted. So it's not a high priority issue but there is some room to better model the behavior.

johnbradley added this to the bespin-k8s milestone Mar 7, 2019

johnbradley removed this from the bespin-k8s milestone Mar 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k8s restart can fail while job being canceled #153

k8s restart can fail while job being canceled #153

johnbradley commented Feb 7, 2019

dleehr commented Mar 7, 2019

k8s restart can fail while job being canceled #153

k8s restart can fail while job being canceled #153

Comments

johnbradley commented Feb 7, 2019

dleehr commented Mar 7, 2019