-
Notifications
You must be signed in to change notification settings - Fork 818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ready for Review] Fix issue where resuming on successful run will fail. #1956
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also add a test so we make sure this is caught in the future (using the core tests maybe).
Also, I commented as well somewhere but high level, I think we can simplify the runtime code to remove all resume operations from places outside the clone functions. We currently update some state when things are resuming that I think we could simplify and get rid of.
metaflow/runtime.py
Outdated
"rerun" steps. This is to ensure that the flow is executed correctly. | ||
""" | ||
for step_name in self._graph.sorted_nodes: | ||
if step_name in self._rerun_steps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably need to run this until stability right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what you mean "stability" here, the sorted nodes are in topological order already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, fair point. Add a small comment to remind they are topologically sorted. We should almost be able to do a boolean flag instead of the double loop but there are annoying corner cases.
def _new_task(self, step, input_paths=None, **kwargs): | ||
if input_paths is None: | ||
may_clone = True | ||
else: | ||
may_clone = all(self._is_cloned[path] for path in input_paths) | ||
|
||
if step in self._clone_steps: | ||
if step in self._rerun_steps: | ||
may_clone = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I think with this new way, we can clean out all the may_clone, etc flags taht are being set. I think it may be confusing because I don't think they are used anymore so that would simplify the code a bit and make it clearer that all the resume logic is now in one place (the clone functions) as opposed to scattered around the runtime code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but the code refactor should be in a different PR (I want this fix PR to be concise).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Let's do a separate PR to clean that up then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a few comments - mostly for me to understand the nature of the issue here
metaflow/cli.py
Outdated
@@ -650,7 +650,7 @@ def resume( | |||
) | |||
|
|||
if step_to_rerun is None: | |||
clone_steps = set() | |||
rerun_steps = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit - could we do steps_to_rerun
- reads a bit better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
metaflow/runtime.py
Outdated
self._cloned_tasks = [] | ||
self._cloned_task_index = set() | ||
self._reentrant = reentrant | ||
self._run_url = None | ||
|
||
# If rerun_steps is specified, we will not clone them in resume mode. | ||
self._rerun_steps = {} if rerun_steps is None else rerun_steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._steps_to_rerun = steps_to_rerun or {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
metaflow/runtime.py
Outdated
def _update_rerun_steps(self): | ||
""" | ||
Any steps following steps to be rerun should also be included as | ||
"rerun" steps. This is to ensure that the flow is executed correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe i am not following the comment here - but rerun
steps are the ones that need to be executed during resume
? if that is the case, why is this update needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for a linear graph a -> b -> c -> d (and b, c succeeded) , when user resume b
, we want b and the successor of b (which is c) to rerun as well. Here this update
function will include all the descendants (b,c,d).
In old resume, I think after this step-to-rerun, it goes back to normal execution mode.
In the new resume, we need to avoid copying the successful successor tasks (c here) as well, and hence update this "steps-to-rerun" to rerun everything after b. note that we won't copy d
because d failed in first run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend not creating a function would just lifting this logic alongside L:119 so that the code reads a bit easy. Not a whole lot of complexity is embedded in this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
metaflow/runtime.py
Outdated
def _update_rerun_steps(self): | ||
""" | ||
Any steps following steps to be rerun should also be included as | ||
"rerun" steps. This is to ensure that the flow is executed correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend not creating a function would just lifting this logic alongside L:119 so that the code reads a bit easy. Not a whole lot of complexity is embedded in this method.
|
||
RESUME = True | ||
# resuming on a successful step. | ||
RESUME_STEP = "start" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be better to resume a "non-start" step in the test so that we can confirm we are able to skip steps successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that scenario, we can perhaps test for the existence of an artifact generated in the start step in the resumed end step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
added a tiny window dressing PR related to this in #1963 |
when running resume with
python flow.py resume {step_name}
, user may want to rerun the "step" and continue with the execution. In current code, we will blindly copy everything as long as the step was successful.The proposed change will look at step in topological order and skip cloning the ones on and after the specified steps (and then rerun everything if possible).