Skip to content

experimental client: download functions review #1339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jku opened this issue Apr 8, 2021 · 10 comments
Closed

experimental client: download functions review #1339

jku opened this issue Apr 8, 2021 · 10 comments
Labels
experimental-client Items related to the development of a new client (see milestone/8 and theexperimental-client branch)

Comments

@jku
Copy link
Member

jku commented Apr 8, 2021

Current state of _mirror_target_download() and _mirror_meta_download() in experimental client is a bit bad:

  • they are actually iterators that keep returning unverified file objects to the caller, this seems unusual and surprising: I was expecting them to return a verified file object
  • caller is expected to verify the contents --- this makes some sense as the verification is different for e.g. root metadata and other metadata and targets... but it should be made very clear that the file object is potentially malicious data at this point
  • none of the callers currently iterate these function results (they just raise if first result is invalid) so it's not clear if mirrors actually work
  • NoWorkingMirrorError can't be raised from these iterators like it now happens (on every iteration that fails), it does not make any sense
  • _mirror_target_download type annotation is wrong

I would suggest we revert to a known working design if possible (_mirror_target_download() could just be almost 1:1 the current clients _get_target_file() function but the metadata verification needs something else).

I'll try to come up with something...

@jku jku added the experimental-client Items related to the development of a new client (see milestone/8 and theexperimental-client branch) label Apr 8, 2021
@jku
Copy link
Member Author

jku commented Apr 8, 2021

this is very much related to #1318 -- the tricky file object handling is in these two functions and in the callers

@joshuagl joshuagl added this to the Experimental client milestone Apr 13, 2021
@jku
Copy link
Member Author

jku commented Apr 15, 2021

I think this is a good place to document the discussion from yesterday (@sechkova can fix mistakes): We've come to the conclusion that separating the download/mirror functionality as I suggested does not work better than the "old way".

There's two things interacting here: the need to download a file from possibly multiple mirrors and the validation and verififcation of that file -- we don't know if we need to download from other mirrors until we've fully validated/verified the file from the first mirror. This is especially an issue for metadata which needs to not just be internally validated but also verified by other current metadata... This means that there are really two choices:

  • mirror-downloader can verify the files (maybe via a verifier object it gets or something): this means the updater loop is now slightly easier to read but obfuscates the logic as it's hard to see when/where the verification happens
  • the updater loop knows about all of the candidate downloads (as current updater does)
  • (extra 3rd choice: metadata should not support mirrors. This would make things simpler and I think no-one wants to mirror metadata anyway... but we've not considered this seriously for now)

The second option just seems better: the updater download loops are monsters but it seems taking this specific for-loop out is not helpful. experimental-client sort of does the second option but with iterators that I think are not an improvement -- we've agreed to take a step back to something even more similar to current updater (no iterators, just a list of mirror-urls : code already known to work).

To make the very long updater loops more readable by other means I'm experimenting with #1313 (taking the metadata validity enforcing elsewhere).

@trishankatdatadog
Copy link
Member

If it makes the code simpler, we should just do away with mirrors. No one really uses them anymore.

@jku
Copy link
Member Author

jku commented Apr 15, 2021

If it makes the code simpler, we should just do away with mirrors. No one really uses them anymore.

ha! I just added an edit mentioning that :)

Unfortunately pip does need multiple hosts for targets. It's not quite the concept of mirrors (the hosts don't have the same targets) but currently the mirrors configuration can be used for it with some wrestling

@jku
Copy link
Member Author

jku commented Apr 15, 2021

Still, now that you said it: if we slightly modified how the mirrors work, they would be much easier to work with:

  • if a metadata/target is found on a mirror then other mirrors will not be queried about it, even if the file fails to validate/verify (currently all mirrors are tried in this case)

This change would make the whole mirroring quite a lot simpler. Obviously it would mean a malicious mirror could now deny all service to a client (but as you said, mirrors aren't really used as actual mirrors so this is already the case -- pypi.org will be able to deny service to pip :) )

@sechkova
Copy link
Contributor

Decided to take a step back, implement the mirrors download in a clean but very repetitive way inside Updater and add some additional clean up of temporary objects: #1352.
This cleaner implementation should serve as a base for next improvements: reducing code repetition together with metadata "bundling".

@trishankatdatadog
Copy link
Member

Unfortunately pip does need multiple hosts for targets. It's not quite the concept of mirrors (the hosts don't have the same targets) but currently the mirrors configuration can be used for it with some wrestling

I don't think this is a problem TUF should solve at the risk of added complexity, especially if it's designed for only one use case in mind. Could pip solve that problem on its own?

@jku
Copy link
Member Author

jku commented Apr 26, 2021

I don't think this is a problem TUF should solve at the risk of added complexity, especially if it's designed for only one use case in mind.

I do agree with the beginning of that sentence. Yet, not taking into account real world use cases creates designs that are great for research projects.

I have not other data but would be surprised if multiple download domains is going to be unique to pip.

Could pip solve that problem on its own?

Well, TUF is designed to resolve a target name to a URL... If this design is used then TUF must be able to choose the host(s) is should use.

One alternative that I can see is that we do not use TUF for anything except metadata: in the current implementation that would mean after get_one_valid_targetinfo() pip would just take the target name and figure out what it should download. This does not feel great either.

Another is that every download call (download_target() in current client) accepts an optional URL prefix. This might be a path we could look into

@jku
Copy link
Member Author

jku commented Apr 26, 2021

Could pip solve that problem on its own?

By the way, in case I've not been clear about this: this is not an issue in how pip is designed, it's an issue with how pypi hosts its files: One host (pypi.org) serves metadata and targets. These targets contain URLs that pip parses to find more host-URLs: these hosts only serve target files -- currently the only host is pythonhosted.org but it could be anything.

Apart from parsing target files to find more hosts, this sounds completely reasonable: assuming that metadata and targets are on the same host (or that targets are all on one host) could be a mistake for TUF.

@sechkova
Copy link
Contributor

Resolved with #1373 and the remove of mirrors support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experimental-client Items related to the development of a new client (see milestone/8 and theexperimental-client branch)
Projects
None yet
Development

No branches or pull requests

4 participants