Skip to content

pkg: allow arbitrary opam repo layouts #11616

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gridbugs
Copy link
Collaborator

@gridbugs gridbugs commented Apr 8, 2025

Opam repos can have fairly arbitrary directory layouts. The only requirement is that all package manifests are under the top-level "packages" directory, and each manifest is in a file named "opam" whose parent directory is named like ".".

Prior to this change dune assumed a layout similar to the official opam-repository, where manifests are at
"packages//./opam".

This generalizes dune's expectations of opam repositories such that arbitrary directory layouts are supported.

Fixes #11615

Opam repos can have fairly arbitrary directory layouts. The only
requirement is that all package manifests are under the top-level
"packages" directory, and each manifest is in a file named "opam" whose
parent directory is named like "<name>.<version>".

Prior to this change dune assumed a layout similar to the official
opam-repository, where manifests are at
"packages/<name>/<name>.<version>/opam".

This generalizes dune's expectations of opam repositories such that
arbitrary directory layouts are supported.

Signed-off-by: Stephen Sherratt <stephen@sherra.tt>
Copy link
Collaborator

@maiste maiste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I have noticed few typos, but I mostly have questions regarding the errors handling. My main question is shouldn't we prevent the users from using a corrupted opam repository?

> opam-version: "2.0"
> EOF

Make some packgaes at various depths in the directory hierarchy:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Make some packgaes at various depths in the directory hierarchy:
Make some packages at various depths in the directory hierarchy:

(Package_version.Map.update version_to_entry version ~f:(function
| None -> Some entry
| Some existing_entry ->
User_warning.emit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to fail instead if such cases occur? It means that the repository doesn't respect the structure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a warning is ok when loading but we should definitely refuse to lock a package like this. Especially since which entry is the existing_entry and which one is the new entry depends on Fpath.traverse which doesn't guarantee order, so you might potentially get different results depending e.g. on which FS the repo is on.

Comment on lines +151 to +153
let opam_file_path_relative_to_repo_root =
Path.Local.relative (Path.Local.relative Paths.packages dir) file
in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be computed at line 171 to make it closer to where it is used?

Comment on lines +214 to +217
let entry =
{ Entry.opam_file_path_relative_to_repo_root = path
; rev_store_file_if_loaded_from_git_repo = Some file
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let entry =
{ Entry.opam_file_path_relative_to_repo_root = path
; rev_store_file_if_loaded_from_git_repo = Some file
}
let entry : Entry.t =
{ opam_file_path_relative_to_repo_root = path
; rev_store_file_if_loaded_from_git_repo = Some file
}

match Path.exists (Path.append_local dir opam_file_path) with
| false -> None
| true ->
let files_dir = Opam_file_table.files_dir_path_of_package opam_file_table package in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it call twice the opam_file_path_of_package function? Shouldn't it be simplified to avoid it?

@@ -0,0 +1,105 @@
Test that dune can solve dependencies with opam repos using a layout other than
the one used by the official opam repo. Opam allows arbitrary dependency
hierarchies inside the "package" directory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hierarchies inside the "package" directory.
hierarchies inside the "packages" directory.

- c.3


Now test that dune can handle some degenerate cases.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Now test that dune can handle some degenerate cases.
Now test that dune can handle some problematic cases.

😄

Copy link
Collaborator

@Leonidas-from-XIV Leonidas-from-XIV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the code is fine mostly and conceptually it is the right thing, as it enables us to use the Jane Street OPAM repo and any repo really, which is definitely neat.

However, I wonder if the complexity and cost in loading the whole repo into memory is worth it. Realistically people are going to use opam-repository, our overlays whose layout we control ourselves. So maybe the Jane Street repo layout could be added when inspecting the repo (e.g. how many layers are there between package/**/<name.version>/opam and then forcing the rest of the repo to match instead of allowing any nesting for any opam file), thus changing very little about what we currently do, basically just some additional processing in Opam_repo.Paths.

It is less flexible but I would argue that the added flexibility is not all that useful (OPAM suffers from this already, where it allows a lot of flexibility, sometimes to its detriment) and the price might be quite high.

I don't have a strong opinion on this however, so if people think that the added complexity, processing time and memory usage is worth it, I am okay with that.

module Opam_file_table = struct
module Entry = struct
type t =
{ opam_file_path_relative_to_repo_root : Path.Local.t
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels overly very verbose, maybe worth shortening it a little?

;;
end

module Version_to_entry = struct
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version_map?


let files_dir_path_of_package (t : t) package =
Option.map (opam_file_path_of_package t package) ~f:(fun opam_file_path ->
Path.Local.relative (Path.Local.parent_exn opam_file_path) "files")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So files is always relative to the opam file? If so, maybe worth writing a short comment to explain the layout.

~f:(fun (entry : Entry.t) -> entry.opam_file_path_relative_to_repo_root))
;;

let files_dir_path_of_package (t : t) package =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let files_dir_path_of_package (t : t) package =
let files_dir_of_package (t : t) package =

Path.Local.relative (Path.Local.parent_exn opam_file_path) "files")
;;

let all_package_versions_of_name (t : t) package_name =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let all_package_versions_of_name (t : t) package_name =
let all_package_versions (t : t) package_name =

of_name is kind of implied by the type of the argument, Package_name.t.

(Package_version.Map.update version_to_entry version ~f:(function
| None -> Some entry
| Some existing_entry ->
User_warning.emit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a warning is ok when loading but we should definitely refuse to lock a package like this. Especially since which entry is the existing_entry and which one is the new entry depends on Fpath.traverse which doesn't guarantee order, so you might potentially get different results depending e.g. on which FS the repo is on.

let of_repo_dir_path opam_repo_dir_path ~loc : t =
let repo_name_for_messages = Path.to_absolute_filename opam_repo_dir_path in
try
Fpath.traverse_files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd really want a Fiber version of this, as for large repos or repos on slow shares this will block Dune for a while. On my machine with an SSD just find -name opam in OPAM repository takes half a second.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fiber or not, it doesn't seem like a good trade-off to introduce such an expensive step that would benefit so few users. Can you either make this optional or introduce a lazy loading scheme for the JST repo?

files named "opam". *)
let of_rev_store_at_rev at_rev ~repo_name_for_messages ~loc : t =
Rev_store.At_rev.directory_entries ~recursive:true at_rev Paths.packages
|> Rev_store.File.Set.fold ~init:empty ~f:(fun file acc ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this materialize every opam file from a Git repo into memory? In the case of OPAM-Repository that's 30MB of data, most of which will never be looked at.

- b.2
- c.3

$ rm custom-opam-repo/packages/opam
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this action for? It seems like there should be some kind of test but isn't.


Create an opam file in a directory that doesn't have a version number. There's
no way to refer to this package in dune-project but dune should still be able
to parse the repo with crashing.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to parse the repo with crashing.
to parse the repo without crashing.

Maybe?

let of_repo_dir_path opam_repo_dir_path ~loc : t =
let repo_name_for_messages = Path.to_absolute_filename opam_repo_dir_path in
try
Fpath.traverse_files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fiber or not, it doesn't seem like a good trade-off to introduce such an expensive step that would benefit so few users. Can you either make this optional or introduce a lazy loading scheme for the JST repo?

@gridbugs
Copy link
Collaborator Author

gridbugs commented Apr 9, 2025

Thanks for the PR! I have noticed few typos, but I mostly have questions regarding the errors handling. My main question is shouldn't we prevent the users from using a corrupted opam repository?

If a repo does not follow the convention set by opam-repository, that doesn't mean it is corrupt. The documentation on repo layout in the opam manual specifies that any layout where opam files are contained in directories named after the package is acceptable. The document also suggests a potential future change to the layout of opam-repository where another level of hierarchy is introduced to group packages by common prefixes, e.g. packages/p/pa/package-name/package-name.version.

I agree with the sentiment that in practice we're probably only ever going to see repos formatted like opam-repository or JST, and the complexity/runtime cost of fully general opam repos is too high to be worth it. I briefly thought about introducing a fast-path where if a repo is formatted like opam-repository then we can make some optimizing assumptions, but the only way to be sure that a repo is formatted like opam-repository is to do a full scan of all its opam files which is just as expensive as not having a fast-path to begin with.

So I'm thinking I'll defer solving #11615 until I have more time to work on it, and eventually we can use a solution more specialized to the layouts we see in opam-repository and JST.

Alternatively, @rgrinberg how feasible would it be to ask Jane Street to change the layout of their repos to match opam-repository?

@maiste
Copy link
Collaborator

maiste commented Apr 9, 2025

If a repo does not follow the convention set by opam-repository, that doesn't mean it is corrupt.

It is a misunderstanding: I was referring to repositories with twice the same file or an opam file at the root.

@rgrinberg
Copy link
Member

I'll ask, but I don't really see an issue with implementing an alternative lazy loading scheme. What we're doing is implementing the following function for a single repository:

val list_packages : Package_name.t -> Package_version.t list

So you could implement this function by:

  1. Try to list packages/$package. If this succeeds, we finish our search.
  2. If the above fails because packages/$package doesn't exist, you read can read all of packages and look for $package.*. The result is now the matches in 3
  3. For subsequent lookups, you want the listing of packages/ to be cached. Otherwise it would be really slow for large repos that aren't from git.

With this scheme, the common users always get the optimized experience, while we still support this alternative layout.
For multiple repositories, we are sadly going to hit the 2nd step more often than we'd like. On the other hand, multiple repositories aren't all that common and the performance problem aren't relevant to opam repositories from git.

@gridbugs
Copy link
Collaborator Author

For multiple repositories, we are sadly going to hit the 2nd step more often than we'd like. On the other hand, multiple repositories aren't all that common and the performance problem aren't relevant to opam repositories from git.

What do you mean by "multiple repositories" here?

@gridbugs
Copy link
Collaborator Author

I just noticed that Jane Street's compiler extensions branch uses a mix of different layouts within its packages directory. Most packages are directly under packages but a few (e.g. ocaml-variants) get a subdirectory in the style of opam-repository. @rgrinberg is that what you mean by a "multiple repository"?

@rgrinberg
Copy link
Member

Multiple repositories can be defined in dune's workspace file in order. Refer to tests like multiple-opam-repos.t for examples.

Mixing the two layouts defeats the ability to implement any lazy loading scheme. I'm going to ask about sticking to one scheme. For now, I think you could just make eager loading "opt-in" for some repos.

@gridbugs
Copy link
Collaborator Author

Is lazy-loading the opam repo something we're planning to implement in the future? Unless I'm mistaken it looks like we currently enumerate all the paths in the opam repo when creating a Rev_store.At_rev.t to populate the files: File.Set.t field (here). I timed dune pkg lock a few times with and without this change and didn't see a noticeable difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dune can only understand opam repos matching the layout of opam-repository
4 participants