Allow rolling upgrade during progressive rollout #12971

yuzisun · 2022-05-27T17:25:36Z

Describe the feature

currently every time knative does rollout for a new revision it requires 2x resources when minReplica is set for a short time of period and then release the resource required for running the old revision after it is scaled down, this is a problem for platform that are enabled with resource quota and user has to budget 2x resources to run the service, it becomes a bigger problem when service is running on precious GPU hardware.

With progressive rollout, do we still need to wait for all the minReplicas to be ready to start migrating the traffic from old revision to new revision? Can we do something like rolling update, when 20% is moved to the new revision we can then start scaling down the old revision accordingly, so that we do not need to require 2x resources during the rollout.

psschwei · 2022-05-27T21:11:54Z

Not sure if this exactly solves your issue, but it is possible to configure a gradual rollout.

I've seen a couple of issues come up around this topic (for example, #12551 #12859), so probably something we should look into.

yuzisun · 2022-05-29T11:04:00Z

unfortunately the current implementation of progressive rollout does not help when minReplica is set. For example if you set minReplica to 10 with 1 GPU for each replica then you need 20 GPUs for the rollout as progressive rollout does not kick in until minReplica requirement is fullfilled.

Checking the code here, the revision is marked active when pc.ready >= minReady, so the traffic only starts to shift to the new revision after that. What we want is Kubernetes deployment's rolling update, traffic can start shifting as long as x% of minReplica is ready and at the same time old revision can be scaled down, this way we do not need 2x resources for every rollout.

dprotaso · 2022-05-30T18:14:47Z

@yuzisun do you have examples of how you're configuring the traffic blocks - or is it latestRevision: true?

With progressive rollout, do we still need to wait for all the minReplicas to be ready to start migrating the traffic from old revision to new revision?

Revisions won't be Ready until they reach their min-scale. I do wonder if this can be adjusted by adding the initial scale annotation =1 which might mark the revision Ready earlier - causing traffic to shift.

The other thing to note is we 'drain' old pods so they can handle any outstanding requests - we also wait in case there are delays in the network programming.

Can we do something like rolling update, when 20% is moved to the new revision we can then start scaling down the old revision accordingly, so that we do not need to require 2x resources during the rollout.

Our autoscaling system currently evaluates scale for revisions independently of each other. Coordination across revisions is currently not supported. Gradual rollout only works because we're manipulating traffic % at a higher level of abstraction.

What we want is Kubernetes deployment's rolling update, traffic can start shifting as long as x% of minReplica is ready and at the same time old revision can be scaled down, this way we do not need 2x resources for every rollout.

A legit question to ask if you're not benefiting from scale to zero - and your clusters don't have the capacity to scale up new pods - maybe you should be using Deployments since seems like you want a fixed set of replicas?

dprotaso · 2022-05-30T18:18:47Z

/triage needs-user-input

yuzisun · 2022-05-31T20:29:09Z

@dprotaso scale-to-zero is not the main reason we use Knative, it is the revision based rollout which is not supported via Deployment. With raw deployment you can't stage traffic and do canary deployment easily, currently KServe's canary rollout implementation is based on Knative revisions. Also it is not the fixed set of replicas, you can still leverage KPA with maxReplica > minReplica, minReplica is just to ensure stable performance for normal traffic load.

yuzisun · 2022-06-10T21:05:43Z

@dprotaso I have tested out initial scale annotation it does not mark the revision Ready earlier as the larger of initial scale and lower bound is chosen as the initial target scale.

github-actions · 2022-09-09T01:30:24Z

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

yuzisun · 2023-01-07T21:37:52Z

/reopen

knative-prow · 2023-01-07T21:37:55Z

@yuzisun: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rachitchauhan43 · 2023-03-10T00:05:14Z

@yuzisun : Are you folks planning to contribute for this feature ?

yuzisun · 2023-05-09T12:41:41Z

@yuzisun : Are you folks planning to contribute for this feature ?

We are still working on a proposal for how to implement this.

houshengbo · 2023-05-25T18:45:15Z

/assign

furykerry · 2023-07-09T08:43:12Z

@dprotaso scale-to-zero is not the main reason we use Knative, it is the revision based rollout which is not
supported via Deployment. With raw deployment you can't stage traffic and do canary deployment easily,
currently KServe's canary rollout implementation is based on Knative revisions. Also it is not the fixed set of
replicas, you can still leverage KPA with maxReplica > minReplica, minReplica is just to ensure stable performance for normal traffic load.

stage traffic and do canary deployment can be achieved through progressive delivery tool sucha as Kruise Rollout
https://openkruise.io/rollouts/user-manuals/strategy-canary-update

dprotaso · 2023-10-03T16:32:10Z

/triage-accepted

dprotaso · 2023-11-17T19:12:38Z

We discussed PR #14487 at the Oct 18th Serving WG meeting.

My suggestion there was to get the functionality working in the progressive rollout repo prior to us deciding on how to make sweeping changes in Serving. This comes with the understanding that this might require copying and pasting some existing code.

The path forward I'm envisioning is the following

Progressive Rollout becomes functional in extensions repo
Get feedback and usage of this feature
Hopefully our serving performance testing issues will be sorted and we can run those tests against the extension repo
Determine necessary changes required to address feedback in 2 & 3 - this could require more rework
Determine if and how we want to merge this functionality back into Serving

I'll close out the extensions PR

wayzeng · 2024-05-20T20:09:55Z

Hi, do we have updates on this? Thanks!

dprotaso · 2024-06-06T21:41:32Z

Hey @wayzeng they're looking for feedback here

https://github.com/knative-extensions/serving-progressive-rollout

wayzeng · 2024-06-07T16:17:01Z

Thank you so much @dprotaso!! will give it a try.

yuzisun added the kind/feature Well-understood/specified features, ready for coding. label May 27, 2022

knative-prow bot added the triage/needs-user-input Issues which are waiting on a response from the reporter label May 30, 2022

dprotaso mentioned this issue May 31, 2022

Progressive rollout to scale down old revision while transferring the traffic to the new revision #12859

Closed

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 9, 2022

github-actions bot closed this as completed Oct 10, 2022

knative-prow bot reopened this Jan 7, 2023

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 8, 2023

ReToCode removed the triage/needs-user-input Issues which are waiting on a response from the reporter label Mar 8, 2023

knative-prow bot assigned houshengbo May 25, 2023

houshengbo mentioned this issue Sep 15, 2023

New Repo: serving-progressive-rollout knative/community#1434

Closed

15 tasks

ReToCode added the triage/accepted Issues which should be fixed (post-triage) label Oct 4, 2023

houshengbo mentioned this issue Oct 6, 2023

Implemented the extension interface to make controller pluggable #14487

Closed

skonto mentioned this issue Sep 9, 2024

Support for multi-pod workloads in knative-serving #15485

Closed

skonto linked a pull request Feb 14, 2025 that will close this issue

[WIP] Skip minScale during rollout for specific workloads #15780

Open

Allow rolling upgrade during progressive rollout #12971

Allow rolling upgrade during progressive rollout #12971

Comments

yuzisun commented May 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the feature

psschwei commented May 27, 2022

Uh oh!

yuzisun commented May 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dprotaso commented May 30, 2022

Uh oh!

dprotaso commented May 30, 2022

Uh oh!

yuzisun commented May 31, 2022

Uh oh!

yuzisun commented Jun 10, 2022

Uh oh!

github-actions bot commented Sep 9, 2022

Uh oh!

yuzisun commented Jan 7, 2023

Uh oh!

knative-prow bot commented Jan 7, 2023

Uh oh!

rachitchauhan43 commented Mar 10, 2023

Uh oh!

yuzisun commented May 9, 2023

Uh oh!

houshengbo commented May 25, 2023

Uh oh!

furykerry commented Jul 9, 2023

Uh oh!

dprotaso commented Oct 3, 2023

Uh oh!

dprotaso commented Nov 17, 2023

Uh oh!

wayzeng commented May 20, 2024

Uh oh!

dprotaso commented Jun 6, 2024

Uh oh!

wayzeng commented Jun 7, 2024

Uh oh!

yuzisun commented May 27, 2022 •

edited

Loading

yuzisun commented May 29, 2022 •

edited

Loading