Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ready for Review] Fix issue where resuming on successful run will fail. #1956

Merged
merged 4 commits into from
Aug 19, 2024

Conversation

darinyu
Copy link
Collaborator

@darinyu darinyu commented Aug 12, 2024

when running resume with python flow.py resume {step_name}, user may want to rerun the "step" and continue with the execution. In current code, we will blindly copy everything as long as the step was successful.

The proposed change will look at step in topological order and skip cloning the ones on and after the specified steps (and then rerun everything if possible).

Copy link
Contributor

@romain-intel romain-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a test so we make sure this is caught in the future (using the core tests maybe).

Also, I commented as well somewhere but high level, I think we can simplify the runtime code to remove all resume operations from places outside the clone functions. We currently update some state when things are resuming that I think we could simplify and get rid of.

"rerun" steps. This is to ensure that the flow is executed correctly.
"""
for step_name in self._graph.sorted_nodes:
if step_name in self._rerun_steps:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably need to run this until stability right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what you mean "stability" here, the sorted nodes are in topological order already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, fair point. Add a small comment to remind they are topologically sorted. We should almost be able to do a boolean flag instead of the double loop but there are annoying corner cases.

def _new_task(self, step, input_paths=None, **kwargs):
if input_paths is None:
may_clone = True
else:
may_clone = all(self._is_cloned[path] for path in input_paths)

if step in self._clone_steps:
if step in self._rerun_steps:
may_clone = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think with this new way, we can clean out all the may_clone, etc flags taht are being set. I think it may be confusing because I don't think they are used anymore so that would simplify the code a bit and make it clearer that all the resume logic is now in one place (the clone functions) as opposed to scattered around the runtime code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but the code refactor should be in a different PR (I want this fix PR to be concise).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let's do a separate PR to clean that up then.

Copy link
Collaborator

@savingoyal savingoyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments - mostly for me to understand the nature of the issue here

metaflow/cli.py Outdated
@@ -650,7 +650,7 @@ def resume(
)

if step_to_rerun is None:
clone_steps = set()
rerun_steps = set()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit - could we do steps_to_rerun - reads a bit better

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

self._cloned_tasks = []
self._cloned_task_index = set()
self._reentrant = reentrant
self._run_url = None

# If rerun_steps is specified, we will not clone them in resume mode.
self._rerun_steps = {} if rerun_steps is None else rerun_steps
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._steps_to_rerun = steps_to_rerun or {}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

def _update_rerun_steps(self):
"""
Any steps following steps to be rerun should also be included as
"rerun" steps. This is to ensure that the flow is executed correctly.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i am not following the comment here - but rerun steps are the ones that need to be executed during resume? if that is the case, why is this update needed?

Copy link
Collaborator Author

@darinyu darinyu Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a linear graph a -> b -> c -> d (and b, c succeeded) , when user resume b, we want b and the successor of b (which is c) to rerun as well. Here this update function will include all the descendants (b,c,d).

In old resume, I think after this step-to-rerun, it goes back to normal execution mode.

In the new resume, we need to avoid copying the successful successor tasks (c here) as well, and hence update this "steps-to-rerun" to rerun everything after b. note that we won't copy d because d failed in first run.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend not creating a function would just lifting this logic alongside L:119 so that the code reads a bit easy. Not a whole lot of complexity is embedded in this method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

def _update_rerun_steps(self):
"""
Any steps following steps to be rerun should also be included as
"rerun" steps. This is to ensure that the flow is executed correctly.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend not creating a function would just lifting this logic alongside L:119 so that the code reads a bit easy. Not a whole lot of complexity is embedded in this method.


RESUME = True
# resuming on a successful step.
RESUME_STEP = "start"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be better to resume a "non-start" step in the test so that we can confirm we are able to skip steps successfully.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that scenario, we can perhaps test for the existence of an artifact generated in the start step in the resumed end step.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@savingoyal
Copy link
Collaborator

added a tiny window dressing PR related to this in #1963

@darinyu darinyu requested a review from savingoyal August 17, 2024 16:53
@savingoyal savingoyal merged commit 9738a87 into master Aug 19, 2024
26 checks passed
@savingoyal savingoyal deleted the fix_resume_on_successful_run branch August 19, 2024 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants