Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LinearCache to optimize StreamEndpoint discovery. #6906

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tsaarni
Copy link
Member

@tsaarni tsaarni commented Feb 19, 2025

This change attempts to improve performance in clusters with a large number of endpoints, as discussed in #6743 (comment). The PR replaces use of Envoy go-control-plane SnapshotCache with LinearCache for EDS.

The LinearCache was previously considered but not adopted due to complications outlined by @skriss in a prior PR

If Envoy already has config for a given resource at a particular version, then on a control plane restart, the version number of the resource in the cache will be reset to 1 (or close to 1), therefore will not be sent to Envoy since Envoy already has a "later" version of the resource.

This PR attempts to mitigate this by generating unique version prefix at each startup.

Fixes #6743

@tsaarni tsaarni requested a review from a team as a code owner February 19, 2025 14:03
@tsaarni tsaarni requested review from skriss and sunjayBhatia and removed request for a team February 19, 2025 14:03
@sunjayBhatia sunjayBhatia requested review from a team, davinci26 and izturn and removed request for a team February 19, 2025 14:04
Signed-off-by: Tero Saarni <tero.saarni@est.tech>
@tsaarni tsaarni force-pushed the eds-performance-fix branch from 8f2cf3c to 7bc7a34 Compare February 19, 2025 14:07
@tsaarni tsaarni added the release-note/small A small change that needs one line of explanation in the release notes. label Feb 19, 2025
Copy link

codecov bot commented Feb 19, 2025

Codecov Report

Attention: Patch coverage is 85.71429% with 5 lines in your changes missing coverage. Please review.

Project coverage is 80.79%. Comparing base (38346c5) to head (7bc7a34).
Report is 19 commits behind head on main.

Files with missing lines Patch % Lines
internal/xdscache/v3/endpointslicetranslator.go 0.00% 3 Missing ⚠️
internal/xdscache/v3/snapshot.go 93.10% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6906      +/-   ##
==========================================
+ Coverage   80.70%   80.79%   +0.08%     
==========================================
  Files         131      131              
  Lines       19816    19802      -14     
==========================================
+ Hits        15993    15999       +6     
+ Misses       3533     3514      -19     
+ Partials      290      289       -1     
Files with missing lines Coverage Δ
internal/xdscache/v3/endpointstranslator.go 86.66% <100.00%> (-0.79%) ⬇️
internal/xdscache/v3/snapshot.go 86.53% <93.10%> (+8.12%) ⬆️
internal/xdscache/v3/endpointslicetranslator.go 79.09% <0.00%> (+4.58%) ⬆️
🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@davinci26 davinci26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before I review, we are extremely interested in this but is there a way to have this behind a feature flag?

Coming from a bit of ignorance but could contour be updated in place and things should just work or would it require to restart all the envoy pods?

@tsaarni
Copy link
Member Author

tsaarni commented Feb 19, 2025

before I review, we are extremely interested in this but is there a way to have this behind a feature flag?

I have not worked with go-control-plane and xDS subscription versioning details, so this should be carefully reviewed. I'd appreciate extra eyes on this.

If necessary, we can add a feature flag, but I'm not sure if it is needed - see below.

Coming from a bit of ignorance but could contour be updated in place and things should just work or would it require to restart all the envoy pods?

I don't believe Envoy pods need to be restarted. As far as I understand, Envoys are completely unaware of the algorithm the server uses; they simply return the last received version info to the server.

I'm still working on fully understanding the difference between the cache implementations. I created https://github.com/tsaarni/grpc-json-sniffer to gain more insight into this issue.

@tsaarni
Copy link
Member Author

tsaarni commented Feb 20, 2025

Copy link

github-actions bot commented Mar 7, 2025

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/small A small change that needs one line of explanation in the release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Contour leader doesn't update endpoints in xDS cache after upstream pods recreation
2 participants