Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write sharded repodata #161

Open
wants to merge 52 commits into
base: main
Choose a base branch
from
Open

Write sharded repodata #161

wants to merge 52 commits into from

Conversation

dholth
Copy link
Contributor

@dholth dholth commented May 10, 2024

Description

What would it look like to generate sharded repodata per conda/ceps#75

Interested in seeing whether we can efficiently generate shards; how repodata patching should work; and whether we could generate shards as the primary artifact and then derive repodata.json from shards [at the same time in "processes a sequence of package names" code]

How to test

Check out this repository and https://github.com/dholth/conda-test-data

Decompress conda_test_data/conda-forge/*/.cache/cache.db.zst

Download conda-forge-repodata-patches-<version>.conda

python3 -m conda_index --write-shards --no-write-monolithic --upstream-stage=clone --no-update-cache --patch-generator ~/miniconda3/pkgs/conda-forge-repodata-patches-20240401.20.33.07-hd8ed1ab_1.conda --output /tmp/shards ~/prog/conda-test-data/conda-forge

Examine output in /tmp/shards

I've begun trying to apply the patches to individual shards. This is slow; should compare against applying the many patches against a whole repodata.json.

Checklist - did you ...

  • Add a file to the news directory (using the template) for the next release's release notes?
  • Add / update necessary tests?
  • Add / update outdated documentation?

@conda-bot conda-bot added the cla-signed [bot] added once the contributor has signed the CLA label May 10, 2024
@dholth dholth force-pushed the sharded-repodata branch from 62eb37f to 3cac14a Compare June 7, 2024 18:23
@@ -91,6 +92,23 @@
repodata_version=2 which is supported in conda 24.5.0 or later.
""",
)
@click.option(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could replace this with an "add-only" or "no-remove" option, that would keep packages in the index even if they are not found in the filesystem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two options are more about testing from a backup of the conda-forge database, they may not survive into the main branch or we could make them easier to use.

@dholth
Copy link
Contributor Author

dholth commented Aug 19, 2024

Now that the CEP is approved, this branch should be completed and become the way to run conda-index.

@dholth dholth mentioned this pull request Aug 23, 2024
2 tasks
@dholth dholth changed the title Sharded repodata experiment Write sharded repodata Aug 30, 2024
@dholth dholth marked this pull request as ready for review March 6, 2025 20:52
show_default=True,
)
@click.option(
"--upstream-stage",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stat table has multiple rows per package and stage. If a package only exists in stage=fs or clone then its metadata is cached and a stage=index row is added. This option doesn't explain the mechanism.

def _make_rss(channel_name, channeldata):
return rss.get_rss(channel_name, channeldata)


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anticipating a "no conda dependency" or "optional current_repodata" feature, we move these to another module.

@@ -481,6 +339,9 @@ class ChannelIndex:
:param channel_url: fsspec URL where package files live. If provided, channel_root will only be used for cache and index output.
:param fs: ``MinimalFS`` instance to be used with channel_url. Wrap fsspec AbstractFileSystem with ``conda_index.index.fs.FsspecFS(fs)``.
:param base_url: Add ``base_url/<subdir>`` to repodata.json to be able to host packages separate from repodata.json
:param save_fs_state: Pass False to use cached filesystem state instead of ``os.listdir(subdir)``
:param write_monolithic: Pass True to write large 'repodata.json' with all packages.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is repodata.json appropriately called the "monolithic" option?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CEP 16 doesn't say anything about this, I think it might have been beneficial to version the repodata format as well, to be clear about it, but alas the CEP has passed.

"monolithic" seems like the closest to the content of the written file, alternatively you could use the fact that it's a JSON file. So maybe "write_monolithic_json"?

@@ -557,7 +439,7 @@ def index(
# begin non-stop "extract packages into cache";
# extract_subdir_to_cache manages subprocesses. Keeps cores busy
# during write/patch/update channeldata steps.
def extract_subdirs_to_cache():
def extract_subdirs_to_cache(): # is the 'prepare' step in 'index_prepared_subdir'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index_prepared_subdir was renamed, we could say that we have to load metadata into the cache before we can generate-repodata-out-of-it.

# exactly these packages (unless they are un-indexable) will
# be in the output repodata
if self.save_fs_state:
cache.save_fs_state(subdir_path)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we use this flag OR overriding the save_fs_state() function to be a noop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... in fact we use this flag AND we override the "get changed packages" function to return an empty list. This preserves the list of packages that belong in the index, and prevents us from trying to update the metadata cache for any of them.

@dholth dholth requested review from jjhelmus and jezdez March 7, 2025 19:27
Copy link
Member

@jezdez jezdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to go through it with a finer comb after re-reading the CEP again. Were you able to compare the result of what prefix is generating with the result of this PR by chance? That might give an indiciation if this is going in a similar direction.

import zstandard
from conda.base.context import context

# BAD BAD BAD - conda internals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yaaay


if self.base_url:
# per https://github.com/conda-incubator/ceps/blob/main/cep-15.md
shards_index["info"]["base_url"] = f"{self.base_url.rstrip('/')}/{subdir}/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth using urljoin here?

db.execute(
"ALTER TABLE index_json ADD COLUMN sha256 AS (json_extract(index_json, '$.sha256'))"
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering when it's time to port to sqlalchemy and add alembic to conda-index 😬

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That port exists, we have a sqlalchemy version of the schema that originated in conda-index, first used for a queryable metadata database.

conda-index tries to be deliberately low-dependency. But if we add postgres support we will certainly use sqlalchemy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point!

@dholth
Copy link
Contributor Author

dholth commented Mar 13, 2025

I'm looking into a sharded repodata comparison test with what prefix-dev's splitter produces...

@jezdez
Copy link
Member

jezdez commented Mar 13, 2025

Excellent news, thank you @dholth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed [bot] added once the contributor has signed the CLA
Projects
Status: 🆕 New
Development

Successfully merging this pull request may close these issues.

6 participants