Write sharded repodata #161

dholth · 2024-05-10T20:17:43Z

Description

What would it look like to generate sharded repodata per conda/ceps#75

Interested in seeing whether we can efficiently generate shards; how repodata patching should work; and whether we could generate shards as the primary artifact and then derive repodata.json from shards [at the same time in "processes a sequence of package names" code]

How to test

Check out this repository and https://github.com/dholth/conda-test-data

Decompress conda_test_data/conda-forge/*/.cache/cache.db.zst

Download conda-forge-repodata-patches-<version>.conda

python3 -m conda_index --write-shards --no-write-monolithic --upstream-stage=clone --no-update-cache --patch-generator ~/miniconda3/pkgs/conda-forge-repodata-patches-20240401.20.33.07-hd8ed1ab_1.conda --output /tmp/shards ~/prog/conda-test-data/conda-forge

Examine output in /tmp/shards

I've begun trying to apply the patches to individual shards. This is slow; should compare against applying the many patches against a whole repodata.json.

Checklist - did you ...

Add a file to the news directory (using the template) for the next release's release notes?
Add / update necessary tests?
Add / update outdated documentation?

…packages

…hards for patching

dholth · 2024-07-01T14:14:42Z

conda_index/cli/__init__.py

@@ -91,6 +92,23 @@
        repodata_version=2 which is supported in conda 24.5.0 or later.
        """,
 )
+@click.option(


We could replace this with an "add-only" or "no-remove" option, that would keep packages in the index even if they are not found in the filesystem.

These two options are more about testing from a backup of the conda-forge database, they may not survive into the main branch or we could make them easier to use.

dholth · 2024-08-19T14:54:17Z

Now that the CEP is approved, this branch should be completed and become the way to run conda-index.

conda_index/index/shards.py

Combining the sharded CLI into the main CLI

dholth · 2025-03-06T21:09:41Z

conda_index/cli/__init__.py

+    show_default=True,
+)
+@click.option(
+    "--upstream-stage",


The stat table has multiple rows per package and stage. If a package only exists in stage=fs or clone then its metadata is cached and a stage=index row is added. This option doesn't explain the mechanism.

dholth · 2025-03-06T21:10:57Z

conda_index/index/__init__.py

-def _make_rss(channel_name, channeldata):
-    return rss.get_rss(channel_name, channeldata)
-
-


Anticipating a "no conda dependency" or "optional current_repodata" feature, we move these to another module.

dholth · 2025-03-06T21:11:44Z

conda_index/index/__init__.py

@@ -481,6 +339,9 @@ class ChannelIndex:
    :param channel_url: fsspec URL where package files live. If provided, channel_root will only be used for cache and index output.
    :param fs: ``MinimalFS`` instance to be used with channel_url. Wrap fsspec AbstractFileSystem with ``conda_index.index.fs.FsspecFS(fs)``.
    :param base_url: Add ``base_url/<subdir>`` to repodata.json to be able to host packages separate from repodata.json
+    :param save_fs_state: Pass False to use cached filesystem state instead of ``os.listdir(subdir)``
+    :param write_monolithic: Pass True to write large 'repodata.json' with all packages.


Is repodata.json appropriately called the "monolithic" option?

CEP 16 doesn't say anything about this, I think it might have been beneficial to version the repodata format as well, to be clear about it, but alas the CEP has passed.

"monolithic" seems like the closest to the content of the written file, alternatively you could use the fact that it's a JSON file. So maybe "write_monolithic_json"?

dholth · 2025-03-06T21:12:43Z

conda_index/index/__init__.py

@@ -557,7 +439,7 @@ def index(
            # begin non-stop "extract packages into cache";
            # extract_subdir_to_cache manages subprocesses. Keeps cores busy
            # during write/patch/update channeldata steps.
-            def extract_subdirs_to_cache():
+            def extract_subdirs_to_cache():  # is the 'prepare' step in 'index_prepared_subdir'


index_prepared_subdir was renamed, we could say that we have to load metadata into the cache before we can generate-repodata-out-of-it.

dholth · 2025-03-06T21:13:16Z

conda_index/index/__init__.py

+                    # exactly these packages (unless they are un-indexable) will
+                    # be in the output repodata
+                    if self.save_fs_state:
+                        cache.save_fs_state(subdir_path)


I think we use this flag OR overriding the save_fs_state() function to be a noop

... in fact we use this flag AND we override the "get changed packages" function to return an empty list. This preserves the list of packages that belong in the index, and prevents us from trying to update the metadata cache for any of them.

jezdez

I'll have to go through it with a finer comb after re-reading the CEP again. Were you able to compare the result of what prefix is generating with the result of this PR by chance? That might give an indiciation if this is going in a similar direction.

jezdez · 2025-03-10T18:16:50Z

conda_index/index/__init__.py

 import zstandard
-from conda.base.context import context
-
-#  BAD BAD BAD - conda internals


jezdez · 2025-03-10T18:19:18Z

conda_index/index/__init__.py

+
+        if self.base_url:
+            # per https://github.com/conda-incubator/ceps/blob/main/cep-15.md
+            shards_index["info"]["base_url"] = f"{self.base_url.rstrip('/')}/{subdir}/"


Is it worth using urljoin here?

conda_index/index/__init__.py

jezdez · 2025-03-10T18:25:00Z

conda_index/index/convert_cache.py

+        db.execute(
+            "ALTER TABLE index_json ADD COLUMN sha256 AS (json_extract(index_json, '$.sha256'))"
+        )
+


Wondering when it's time to port to sqlalchemy and add alembic to conda-index 😬

That port exists, we have a sqlalchemy version of the schema that originated in conda-index, first used for a queryable metadata database.

conda-index tries to be deliberately low-dependency. But if we add postgres support we will certainly use sqlalchemy.

Ah, good point!

tests/environment.yml

dholth · 2025-03-13T16:13:14Z

I'm looking into a sharded repodata comparison test with what prefix-dev's splitter produces...

jezdez · 2025-03-13T18:10:16Z

Excellent news, thank you @dholth

dholth added 2 commits May 10, 2024 15:04

begin sharded repodata creation

a2cc811

remove non-overridden method

9001bc8

conda-bot added the cla-signed [bot] added once the contributor has signed the CLA label May 10, 2024

dholth added 13 commits May 10, 2024 16:57

write to output_root

ba6759a

move save_fs_state out of extract_subdir_to_cache

11392e5

shards cli capable of generating repodata from cache database and no …

fd6c9bc

…packages

create output subdir if necessary

504a61c

allow expanduser in --patch-generator; use output_path when reading s…

d049dc2

…hards for patching

continue applying patches to shards

e979689

add --no-current-repodata option

4f17a15

maintain current_repodata=True as the default

90c92a2

Merge branch 'main' into sharded-repodata

69a1e6c

Merge branch 'main' into sharded-repodata

de75383

skip test_cli; why does it cause test failure?

012f234

add news

9584bb5

begin combine-small-shards experiment

3cac14a

dholth force-pushed the sharded-repodata branch from 62eb37f to 3cac14a Compare June 7, 2024 18:23

dholth added 2 commits June 12, 2024 12:08

include virtual name, sha256 columns in create() as well as migration

45d3a7c

Merge branch 'main' into sharded-repodata

06bbd84

dholth commented Jul 1, 2024

View reviewed changes

dholth mentioned this pull request Jul 22, 2024

Implement sharded repodata CEP conda/conda#14060

Open

2 tasks

Merge branch 'main' into sharded-repodata

4e4b3fd

travishathaway reviewed Aug 20, 2024

View reviewed changes

conda_index/index/shards.py Outdated Show resolved Hide resolved

travishathaway reviewed Aug 20, 2024

View reviewed changes

conda_index/index/shards.py Outdated Show resolved Hide resolved

Merge branch 'main' into sharded-repodata

193b6c8

dholth mentioned this pull request Aug 23, 2024

Support CEP-16 sharded repodata #182

Open

2 tasks

travishathaway and others added 2 commits August 27, 2024 14:26

combining the sharded CLI into the main CLI

04afa01

Merge pull request #2 from travishathaway/merge-sharded-repodata-cli

c01824b

Combining the sharded CLI into the main CLI

dholth changed the title ~~Sharded repodata experiment~~ Write sharded repodata Aug 30, 2024

dholth and others added 16 commits February 27, 2025 16:27

use upload-pages-artifact@v3

e5ed839

Update conda_index/cli/__init__.py

e960913

refactor 'save package to database' as own function

7c11428

merge sharded repodata into base sqlitecache

c8d88d5

update cli

a5c834e

move current_repodata generator to its own file

17f61ae

restore VersionOrder import

5fa858e

update news

6d84ebf

remove _make_rss in favor of rss.get_rss()

c2c7537

write patched, unpatched sharded repodata

0fd2a91

update docstring

38db23d

remove unused function

4b376e3

Merge branch 'main' into sharded-repodata

55b2a06

improve coverage

ec5ebb5

coverage

8d8472d

remove obsolete shards_example

11fabe6

dholth marked this pull request as ready for review March 6, 2025 20:52

add scare word to experimental options; rename function

60fa069

dholth commented Mar 6, 2025

View reviewed changes

dholth requested review from jjhelmus and jezdez March 7, 2025 19:27

jezdez reviewed Mar 10, 2025

View reviewed changes

dholth added 2 commits March 10, 2025 16:05

remove some comments

79b39e2

add msgpack-python to recipe

7400343

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write sharded repodata #161

Write sharded repodata #161

dholth commented May 10, 2024 •

edited

Loading

dholth Jul 1, 2024

dholth Sep 19, 2024

dholth commented Aug 19, 2024

dholth Mar 6, 2025

dholth Mar 6, 2025

dholth Mar 6, 2025

jezdez Mar 10, 2025

dholth Mar 6, 2025

dholth Mar 6, 2025

dholth Mar 10, 2025

jezdez left a comment

jezdez Mar 10, 2025

jezdez Mar 10, 2025

jezdez Mar 10, 2025

dholth Mar 13, 2025

jezdez Mar 13, 2025

dholth commented Mar 13, 2025

jezdez commented Mar 13, 2025

		def _make_rss(channel_name, channeldata):
		return rss.get_rss(channel_name, channeldata)

Write sharded repodata #161

Are you sure you want to change the base?

Write sharded repodata #161

Conversation

dholth commented May 10, 2024 • edited Loading

Description

Checklist - did you ...

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dholth commented Aug 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jezdez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dholth commented Mar 13, 2025

jezdez commented Mar 13, 2025

dholth commented May 10, 2024 •

edited

Loading