Skip to content

Allow rolling upgrade during progressive rollout #12971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yuzisun opened this issue May 27, 2022 · 18 comments · May be fixed by #15780
Open

Allow rolling upgrade during progressive rollout #12971

yuzisun opened this issue May 27, 2022 · 18 comments · May be fixed by #15780
Assignees
Labels
kind/feature Well-understood/specified features, ready for coding. triage/accepted Issues which should be fixed (post-triage)

Comments

@yuzisun
Copy link

yuzisun commented May 27, 2022

Describe the feature

currently every time knative does rollout for a new revision it requires 2x resources when minReplica is set for a short time of period and then release the resource required for running the old revision after it is scaled down, this is a problem for platform that are enabled with resource quota and user has to budget 2x resources to run the service, it becomes a bigger problem when service is running on precious GPU hardware.

With progressive rollout, do we still need to wait for all the minReplicas to be ready to start migrating the traffic from old revision to new revision? Can we do something like rolling update, when 20% is moved to the new revision we can then start scaling down the old revision accordingly, so that we do not need to require 2x resources during the rollout.

@yuzisun yuzisun added the kind/feature Well-understood/specified features, ready for coding. label May 27, 2022
@psschwei
Copy link
Contributor

Not sure if this exactly solves your issue, but it is possible to configure a gradual rollout.

I've seen a couple of issues come up around this topic (for example, #12551 #12859), so probably something we should look into.

@yuzisun
Copy link
Author

yuzisun commented May 29, 2022

unfortunately the current implementation of progressive rollout does not help when minReplica is set. For example if you set minReplica to 10 with 1 GPU for each replica then you need 20 GPUs for the rollout as progressive rollout does not kick in until minReplica requirement is fullfilled.

Checking the code here, the revision is marked active when pc.ready >= minReady, so the traffic only starts to shift to the new revision after that. What we want is Kubernetes deployment's rolling update, traffic can start shifting as long as x% of minReplica is ready and at the same time old revision can be scaled down, this way we do not need 2x resources for every rollout.

@dprotaso
Copy link
Member

@yuzisun do you have examples of how you're configuring the traffic blocks - or is it latestRevision: true?

With progressive rollout, do we still need to wait for all the minReplicas to be ready to start migrating the traffic from old revision to new revision?

Revisions won't be Ready until they reach their min-scale. I do wonder if this can be adjusted by adding the initial scale annotation =1 which might mark the revision Ready earlier - causing traffic to shift.

The other thing to note is we 'drain' old pods so they can handle any outstanding requests - we also wait in case there are delays in the network programming.

Can we do something like rolling update, when 20% is moved to the new revision we can then start scaling down the old revision accordingly, so that we do not need to require 2x resources during the rollout.

Our autoscaling system currently evaluates scale for revisions independently of each other. Coordination across revisions is currently not supported. Gradual rollout only works because we're manipulating traffic % at a higher level of abstraction.

What we want is Kubernetes deployment's rolling update, traffic can start shifting as long as x% of minReplica is ready and at the same time old revision can be scaled down, this way we do not need 2x resources for every rollout.

A legit question to ask if you're not benefiting from scale to zero - and your clusters don't have the capacity to scale up new pods - maybe you should be using Deployments since seems like you want a fixed set of replicas?

@dprotaso
Copy link
Member

/triage needs-user-input

@knative-prow knative-prow bot added the triage/needs-user-input Issues which are waiting on a response from the reporter label May 30, 2022
@yuzisun
Copy link
Author

yuzisun commented May 31, 2022

@dprotaso scale-to-zero is not the main reason we use Knative, it is the revision based rollout which is not supported via Deployment. With raw deployment you can't stage traffic and do canary deployment easily, currently KServe's canary rollout implementation is based on Knative revisions. Also it is not the fixed set of replicas, you can still leverage KPA with maxReplica > minReplica, minReplica is just to ensure stable performance for normal traffic load.

@yuzisun
Copy link
Author

yuzisun commented Jun 10, 2022

@dprotaso I have tested out initial scale annotation it does not mark the revision Ready earlier as the larger of initial scale and lower bound is chosen as the initial target scale.

@github-actions
Copy link

github-actions bot commented Sep 9, 2022

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 9, 2022
@yuzisun
Copy link
Author

yuzisun commented Jan 7, 2023

/reopen

@knative-prow knative-prow bot reopened this Jan 7, 2023
@knative-prow
Copy link

knative-prow bot commented Jan 7, 2023

@yuzisun: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 8, 2023
@ReToCode ReToCode removed the triage/needs-user-input Issues which are waiting on a response from the reporter label Mar 8, 2023
@rachitchauhan43
Copy link

@yuzisun : Are you folks planning to contribute for this feature ?

@yuzisun
Copy link
Author

yuzisun commented May 9, 2023

@yuzisun : Are you folks planning to contribute for this feature ?

We are still working on a proposal for how to implement this.

@houshengbo
Copy link
Contributor

/assign

@furykerry
Copy link

@dprotaso scale-to-zero is not the main reason we use Knative, it is the revision based rollout which is not
supported via Deployment. With raw deployment you can't stage traffic and do canary deployment easily,
currently KServe's canary rollout implementation is based on Knative revisions. Also it is not the fixed set of
replicas, you can still leverage KPA with maxReplica > minReplica, minReplica is just to ensure stable performance for normal traffic load.

stage traffic and do canary deployment can be achieved through progressive delivery tool sucha as Kruise Rollout
https://openkruise.io/rollouts/user-manuals/strategy-canary-update

@dprotaso
Copy link
Member

dprotaso commented Oct 3, 2023

/triage-accepted

@ReToCode ReToCode added the triage/accepted Issues which should be fixed (post-triage) label Oct 4, 2023
@dprotaso
Copy link
Member

We discussed PR #14487 at the Oct 18th Serving WG meeting.

My suggestion there was to get the functionality working in the progressive rollout repo prior to us deciding on how to make sweeping changes in Serving. This comes with the understanding that this might require copying and pasting some existing code.

The path forward I'm envisioning is the following

  1. Progressive Rollout becomes functional in extensions repo
  2. Get feedback and usage of this feature
  3. Hopefully our serving performance testing issues will be sorted and we can run those tests against the extension repo
  4. Determine necessary changes required to address feedback in 2 & 3 - this could require more rework
  5. Determine if and how we want to merge this functionality back into Serving

I'll close out the extensions PR

@wayzeng
Copy link

wayzeng commented May 20, 2024

Hi, do we have updates on this? Thanks!

@dprotaso
Copy link
Member

dprotaso commented Jun 6, 2024

Hey @wayzeng they're looking for feedback here

https://github.com/knative-extensions/serving-progressive-rollout

@wayzeng
Copy link

wayzeng commented Jun 7, 2024

Thank you so much @dprotaso!! will give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Well-understood/specified features, ready for coding. triage/accepted Issues which should be fixed (post-triage)
Projects
None yet
8 participants