[Ready for Review] Fix issue where resuming on successful run will fail. #1956

darinyu · 2024-08-12T22:06:21Z

when running resume with python flow.py resume {step_name}, user may want to rerun the "step" and continue with the execution. In current code, we will blindly copy everything as long as the step was successful.

The proposed change will look at step in topological order and skip cloning the ones on and after the specified steps (and then rerun everything if possible).

romain-intel

Let's also add a test so we make sure this is caught in the future (using the core tests maybe).

Also, I commented as well somewhere but high level, I think we can simplify the runtime code to remove all resume operations from places outside the clone functions. We currently update some state when things are resuming that I think we could simplify and get rid of.

romain-intel · 2024-08-12T22:57:36Z

metaflow/runtime.py

+        "rerun" steps. This is to ensure that the flow is executed correctly.
+        """
+        for step_name in self._graph.sorted_nodes:
+            if step_name in self._rerun_steps:


we probably need to run this until stability right?

not sure what you mean "stability" here, the sorted nodes are in topological order already.

Ah, yes, fair point. Add a small comment to remind they are topologically sorted. We should almost be able to do a boolean flag instead of the double loop but there are annoying corner cases.

romain-intel · 2024-08-12T22:58:49Z

metaflow/runtime.py

    def _new_task(self, step, input_paths=None, **kwargs):
        if input_paths is None:
            may_clone = True
        else:
            may_clone = all(self._is_cloned[path] for path in input_paths)

-        if step in self._clone_steps:
+        if step in self._rerun_steps:
            may_clone = False


In general, I think with this new way, we can clean out all the may_clone, etc flags taht are being set. I think it may be confusing because I don't think they are used anymore so that would simplify the code a bit and make it clearer that all the resume logic is now in one place (the clone functions) as opposed to scattered around the runtime code.

I agree, but the code refactor should be in a different PR (I want this fix PR to be concise).

OK. Let's do a separate PR to clean that up then.

savingoyal

left a few comments - mostly for me to understand the nature of the issue here

savingoyal · 2024-08-13T22:17:35Z

metaflow/cli.py

@@ -650,7 +650,7 @@ def resume(
            )

    if step_to_rerun is None:
-        clone_steps = set()
+        rerun_steps = set()


minor nit - could we do steps_to_rerun - reads a bit better

savingoyal · 2024-08-13T22:19:27Z

metaflow/runtime.py

        self._cloned_tasks = []
        self._cloned_task_index = set()
        self._reentrant = reentrant
        self._run_url = None

+        # If rerun_steps is specified, we will not clone them in resume mode.
+        self._rerun_steps = {} if rerun_steps is None else rerun_steps


self._steps_to_rerun = steps_to_rerun or {}

savingoyal · 2024-08-13T22:22:21Z

metaflow/runtime.py

+    def _update_rerun_steps(self):
+        """
+        Any steps following steps to be rerun should also be included as
+        "rerun" steps. This is to ensure that the flow is executed correctly.


maybe i am not following the comment here - but rerun steps are the ones that need to be executed during resume? if that is the case, why is this update needed?

for a linear graph a -> b -> c -> d (and b, c succeeded) , when user resume b, we want b and the successor of b (which is c) to rerun as well. Here this update function will include all the descendants (b,c,d).

In old resume, I think after this step-to-rerun, it goes back to normal execution mode.

In the new resume, we need to avoid copying the successful successor tasks (c here) as well, and hence update this "steps-to-rerun" to rerun everything after b. note that we won't copy d because d failed in first run.

I would recommend not creating a function would just lifting this logic alongside L:119 so that the code reads a bit easy. Not a whole lot of complexity is embedded in this method.

savingoyal · 2024-08-16T05:18:54Z

metaflow/runtime.py

+    def _update_rerun_steps(self):
+        """
+        Any steps following steps to be rerun should also be included as
+        "rerun" steps. This is to ensure that the flow is executed correctly.


I would recommend not creating a function would just lifting this logic alongside L:119 so that the code reads a bit easy. Not a whole lot of complexity is embedded in this method.

savingoyal · 2024-08-16T05:20:43Z

test/core/tests/resume_succeeded_step.py

+
+    RESUME = True
+    # resuming on a successful step.
+    RESUME_STEP = "start"


it might be better to resume a "non-start" step in the test so that we can confirm we are able to skip steps successfully.

in that scenario, we can perhaps test for the existence of an artifact generated in the start step in the resumed end step.

savingoyal · 2024-08-16T05:23:32Z

added a tiny window dressing PR related to this in #1963

do not copy anything following a to-be-run steps

89424e4

romain-intel reviewed Aug 12, 2024

View reviewed changes

savingoyal mentioned this pull request Aug 13, 2024

Errors resuming a workflow #1958

Closed

update test

9855e40

savingoyal reviewed Aug 13, 2024

View reviewed changes

include test file

612866c

savingoyal requested changes Aug 16, 2024

View reviewed changes

savingoyal mentioned this pull request Aug 16, 2024

metaflow failed to resume a flow #1965

Closed

test resume on a non-start step

109370b

darinyu requested a review from savingoyal August 17, 2024 16:53

savingoyal approved these changes Aug 19, 2024

View reviewed changes

savingoyal merged commit 9738a87 into master Aug 19, 2024
26 checks passed

savingoyal deleted the fix_resume_on_successful_run branch August 19, 2024 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ready for Review] Fix issue where resuming on successful run will fail. #1956

[Ready for Review] Fix issue where resuming on successful run will fail. #1956

darinyu commented Aug 12, 2024

romain-intel left a comment

romain-intel Aug 12, 2024

darinyu Aug 12, 2024

romain-intel Aug 13, 2024

romain-intel Aug 12, 2024

darinyu Aug 12, 2024

romain-intel Aug 13, 2024

savingoyal left a comment

savingoyal Aug 13, 2024

darinyu Aug 16, 2024

savingoyal Aug 13, 2024

darinyu Aug 16, 2024

savingoyal Aug 13, 2024

darinyu Aug 14, 2024 •

edited

Loading

savingoyal Aug 16, 2024

darinyu Aug 16, 2024

savingoyal Aug 16, 2024

savingoyal Aug 16, 2024

savingoyal Aug 16, 2024

darinyu Aug 16, 2024

savingoyal commented Aug 16, 2024

[Ready for Review] Fix issue where resuming on successful run will fail. #1956

[Ready for Review] Fix issue where resuming on successful run will fail. #1956

Conversation

darinyu commented Aug 12, 2024

romain-intel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

savingoyal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darinyu Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

savingoyal commented Aug 16, 2024

darinyu Aug 14, 2024 •

edited

Loading