Skip to content

Suspend/Resume feature on SparkApplication #2387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

everpeace
Copy link

@everpeace everpeace commented Jan 17, 2025

Purpose of this PR

This PR implements Suspend/Resume operation on SparkApplication.

Proposed changes:

  • Add Suspend: true|false field in SparkApplicationSpec
  • State Transition:
    • Suspend: ** --Suspend: true--> SUSPENDING --> SUSPENDED (**=any non-Terminal Application State)
    • Resume: SUSPENDING | SUSPENDED --Suspend: false--> RESUMING --> SUBMITTED

Fixes #2290

Change Category

  • Bugfix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that could affect existing functionality)
  • Documentation update

Rationale

Supporting Suspend/Resume feature could make Kueue integration easier.

Checklist

  • I have conducted a self-review of my own code.
  • I have updated documentation accordingly.
  • I have added tests that prove my changes are effective or that my feature works.
  • Existing unit tests pass locally with my changes.

Additional Notes

@Kevinz857
Copy link
Contributor

@everpeace Maybe u can resolve this conflict so that this PR can continue to move forward?

we also think about use kueue with sparkoperator , depend this suspend state feature . It would be better if you could push further down ~

…cationStates(Suspending/Suspended/Resuming)

Signed-off-by: Shingo Omura <everpeace@gmail.com>
…r and its unit tests

Signed-off-by: Shingo Omura <everpeace@gmail.com>
Signed-off-by: Shingo Omura <everpeace@gmail.com>
Signed-off-by: Shingo Omura <everpeace@gmail.com>
Signed-off-by: Shingo Omura <everpeace@gmail.com>
…on loop.

Signed-off-by: Shingo Omura <everpeace@gmail.com>
@everpeace
Copy link
Author

@Kevinz857 Hi, thanks for your comment!! I rebaed my PR.

@everpeace
Copy link
Author

we also think about use kueue with sparkoperator , depend this suspend state feature . It would be better if you could push further down ~

My original SparkApplication integration PR in Kueue(kubernetes-sigs/kueue#4032) was actually closed because

However, I already opened a PR for this which will be needed to support SparkApplication integration in Kueue: kubernetes-sigs/kueue#4102.

@ChenYi015
Copy link
Contributor

/assign @jacobsalway @vara-bonthu @ImpSy

@ChenYi015
Copy link
Contributor

@Kevinz857 Would you mind help reviewing the PR?

@Kevinz857
Copy link
Contributor

@Kevinz857 Would you mind help reviewing the PR?

@ChenYi015 Very happy, I will review this PR together in detail later.

@Kevinz857
Copy link
Contributor

@everpeace Suggestions for Improvement

  1. Controller logic abstraction
    Consider extracting the suspension check (if app.Spec.Suspend != nil && *app.Spec.Suspend) into a helper function like isSuspended(app) for clarity and potential reuse.
    Also, using a named constant for the event string ("SparkApplicationSuspended") would improve consistency and readability.

  2. Pod termination check reusability
    The logic that checks for Spark driver/executor pod termination before resuming could be encapsulated in a helper function (e.g. areSparkPodsTerminated(app)) to make the code more modular and easier to test.
    Optionally, adding retry limits or exponential backoff can help avoid indefinite retries in corner cases.

  3. Field documentation enhancement
    In sparkapplication_types.go, the Suspend field comment could be more descriptive, such as:

// Suspend indicates whether the SparkApplication should be suspended.
// When true, the controller skips submitting the Spark job.

  1. Test enhancement ideas

Add unit tests for the suspension logic to ensure the controller skips job submission when .spec.suspend is true.
Consider validating whether the correct Kubernetes events (Suspended/Resumed) are emitted using a fake event recorder.

@ChenYi015 If have time, we can take a look together to see if these suggestions are suitable

@everpeace Or if there are any better suggestions, we can discuss them together

Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jacobsalway. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@everpeace
Copy link
Author

everpeace commented Apr 30, 2025

@Kevinz857 Thanks for your review.

Consider extracting the suspension check (if app.Spec.Suspend != nil && *app.Spec.Suspend) into a helper function like isSuspended(app) for clarity and potential reuse.

hmmm. SparkApplicationSpec.Suspend is not a pointer. So, I can't see any motivation for it.

Suspend bool `json:"suspend"`

Perhaps, you saw ScheduledSparkApplication's suspension check:

if scheduledApp.Spec.Suspend != nil && *scheduledApp.Spec.Suspend {
return ctrl.Result{}, nil
}

Could you elaborate on your suggestion if I missed something?

Also, using a named constant for the event string ("SparkApplicationSuspended") would improve consistency and readability.

hmmm. I think I already did.

EventSparkApplicationSuspending = "SparkApplicationSuspending"
EventSparkApplicationSuspended = "SparkApplicationSuspended"
EventSparkApplicationResuming = "SparkApplicationResuming"
)

Could you elaborate on your suggestion if I missed something?

Field documentation enhancement
In sparkapplication_types.go, the Suspend field comment could be more descriptive, such as:

Thanks. Fixed in b3f8e8c

Test enhancement ideas
Add unit tests for the suspension logic to ensure the controller skips job submission when .spec.suspend is true.

I think it's already covered:

When("reconciling a new SparkApplication with Suspend=True", func() {
BeforeEach(func() {
app.Spec.Suspend = true
Expect(k8sClient.Create(ctx, app)).To(Succeed())
})
AfterEach(func() {

Could you elaborate on your suggestion if I missed something?

Consider validating whether the correct Kubernetes events (Suspended/Resumed) are emitted using a fake event recorder.

IIUC, the current SparkApplication unit tests(internal/controller/sparkapplication/controller_test.go) do not assert emitted events at all. Although I fully agree your suggestion(we should assert emitted events in unit tests), I think the enhancement should be done in another PR so that all the unit tests assert emitted events. Or, should this PR cover event assertions only for suspend/resume in this PR?? If so, I would be happy to handle it.

… make it more descriptive

Signed-off-by: Shingo Omura <everpeace@gmail.com>
@Kevinz857
Copy link
Contributor

@everpeace Thanks for your contribution !

@ChenYi015 I think the current submission can be merged. This commit will not have a destructive impact on the current core functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for suspend
6 participants