-
Notifications
You must be signed in to change notification settings - Fork 18k
io/fs, net/http: define interface for automatic ETag serving #60940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have a hacky implementation available here: |
Thinking out loud (sorry for the noise), it seems even better to add an optional method to
Updated proposal:First, in io/fs, define // A FileHashesInfo provides the file hashes in constant time.
type FileHashesInfo interface {
fs.FileInfo
// FileHashes returns content hashes of the file that uniquely
// identifies the file contents.
// The returned hashes should be of the form algorithm-base64,
// and implementations are encouraged to use sha256, sha384, or sha512
// as the algorithms and RawStdEncoding as base64 encoding,
// for interoperability with other systems.
//
// FileHashes must NOT compute any hash of the file during the call.
// That is, it must run in time O(1) not O(length of file).
// If no content hash is already available, FileHashes should
// return nil rather than take the time to compute one.
FileHashes() []string
} Second, in net/http, when serving a File (in func setEtag(w ResponseWriter, fi fs.FileInfo) {
if ch, ok := fi.(FileHashesInfo); ok {
if w.Header().Get("Etag") != "" {
return
}
for _, h := range ch.FileHashes() {
// TODO: skip the hash if unsuitable (define "suitable")
// TODO: should the etag be weak or strong?
w.Header().Set("Etag", `W/"`+h+`"`)
break
}
}
} Third (probably out of scope for this proposal), add the |
This proposal has been added to the active column of the proposals project |
To summarize the proposal above:
I think we can probably improve on a few of these decisions.
|
Thanks for the feedback!
Logically, I would put the Trying to be more concrete, I was able to find 3 implementations of fs.FS in the stdlib:
So the first case dos not really influence the decision. For cases outside of the stdlib, I looked up S3 implementations and found that the hashes were returned when GETting the file or requesting the HEAD (so adding it to the FileInfo would mirror both ways of accessing the hashes, while attaching to
I would drop the
I really like your
My example code only sets the ETAg once. I think this should be sufficient. However to work fine, the implementer should:
PS: do you think that dropping a comment in the previous proposal would be a good idea, to gather more feedback? |
Updated proposal, taking into accounts the comments above: // ContentHashesInfo provides pre-computed hashes of the file contents.
type ContentHashesInfo interface {
FileInfo
// ContentHashes returns pre-computed hashes of the file contents.
//
// ContentHashes must NOT compute any hash of the file during the call.
// That is, it must run in time O(1) not O(length of file).
// If no content hash is already available, ContentHashes should
// return nil rather than take the time to compute one.
//
// The order of the returned hash must be stable.
ContentHashes() []Hash
}
// Hash indicates the hash of a given content.
type Hash struct {
// Algorithm indicates the algorithm used. Implementations are encouraged
// to use package-like name for interoperability with other systems
// (lowercase, without dash: e.g. sha256, sha1, crc32)
Algorithm string
// Sum is the result of the hash, it should not be modified.
Sum []byte
} I have created a new |
I'm still on the fence about FileInfo vs File, but I'm willing to try FileInfo and see how it goes. It seems like we are at:
The remaining question in my reply above is (4), namely what does HTTP do when Hash returns multiple hashes? As far as I can tell it makes no sense to send back multiple ETag headers. |
I would suggest to use the first suitable hash. For instance taking the first one with at least 32 bits (and truncating it to 512 bits): if w.Header().Get("Etag") != "" {
return
}
const minLen, maxLen = 4, 64
for _, h := range ch.ContentHashes() {
buf := h.Sum
if len(buf) < minLen {
// hash should have at least 32 bits
continue
}
if len(buf) > maxLen {
buf = buf[:maxLen]
}
// Strong etag: any encoding middleware should set it to weak.
w.Header().Set("Etag", `"`+base64.RawStdEncoding.EncodeToString(buf)+`"`)
break
} |
Nit: Hash() returns more than one Hash. Hashes()? |
It seems fine for Hash to return []Hash. It doesn't have to be Hashes. Have all remaining concerns about this proposal been addressed? |
Based on the discussion above, this proposal seems like a likely accept. |
No change in consensus, so accepted. 🎉 The proposal details are as follows. In io/fs, we add:
Then, in net/http.serveFile, serveFile calls Stat, and if the result implements HashFileInfo, it calls info.Hash. If that returns >=1 hashes, serveFile uses hash[0] as the Etag header, formatting it using Alg+":"+base64(Sum). In package embed, the file type would add a Hash method and an assertion that it implements HashFileInfo. It would return a single hash with Algorithm “sha256”. |
@oliverpool If you're interested in working on this, feel free to send a patch |
Should there be a package function in fs like // Hashes returns fi.Hash() if fi implements HashFileInfo. Otherwise it returns nil.
func Hashes(fi FileInfo) []Hash {
if hfi, ok := fi.(HashFileInfo); ok {
return hfi.Hash()
}
return nil
} |
Change https://go.dev/cl/562875 mentions this issue: |
@earthboundkid I don't think that there is a need for it. However if usage shows that there is interest in such a package function, it can be added later. |
Pretty sure it was put there specifically to support ETag. It's ready to be used when we are ready to use it. |
I am not fully convinced by this line of reasoning. The goal of having an exported |
I have updated the CL with @neild suggestions:
I am not sure that I would be able to change the hash of I would appreciate any feedback, to move this forward. |
Proposal committee, I'd appreciate your input here. This proposal is accepted, but there are some questions regarding it that need addressing. To summarize (see discussion above for details):
|
Isn't it maybe easier to make a generic So then you could have something like:
This could then work with any source of a file system, like |
I haven't replied here yet because every time I sit down to reply, I don't have any good answers. I will try to figure this out, but it will probably have to be after the Go 1.23 freeze. The intersection of cgo, sha256, boringcrypto, and these hashes makes things more difficult than they should be. |
What if we define a new hash function "goembed1" which is notsha256("goembed1"+content) truncated to 128 bits? |
I also agree with @neild's comment above that we should not add anything that returns a file hash < 128 bits to the standard library. |
@rsc What about my proposal for a general |
Anyone can write fs.WithHash themselves and they are welcome to. The point here was to provide a pre-computed hash that avoids the startup cost. |
I don't have any particularly good answers either. Perhaps we should say that go:embed files should provide a sha256 hash, and that figuring out whatever boringcrypto issues block providing one is a prerequisite for implementing that portion of this proposal? We can add the |
This sounds like a reasonable compromise and a good way forward in the meantime.
The leak would be fixed if the algorithm name is not considered for the ETag (as currently implemented, as suggested in https://go-review.googlesource.com/c/go/+/562875/19..26/src/net/http/fs.go#614). Or do you mean that the caller can recompute notsha256("goembed1"+content) and check the resulting hash against the ETag? (in this case the leak is more or less already present since go:embed files have not modtime when served).
I would rather have the standard library makes use of (standard) interfaces as much as possible. The
The core problem this proposal attempts to address is "automatic serving of ETag headers for static content, using content hashes" (as stated in #43223). I would prefer delaying this only if a clear and timed implementation strategy is worked out. |
Many files served by many web servers don't have a modification time. However, only files served by Go webservers using //go:embed files would have an ETag matching notsha256("goembed1"+content). I don't know that this matters, but it would be an informational leak.
The proposed HashFileInfo documentation states that the hashes provided "uniquely identify" the file contents. While obviously no hash is a literal unique identifier, a 32-bit hash is short enough to be highly susceptible to birthday paradox collisions even outside adversarial scenarios, and cannot serve as a (pseudo-)unique identifier. |
I hadn't seen the CL yet that drops the algorithm: from the Etag. That sounds like a good suggestion, and it eliminates the concerns about leaking that the hash is goembed1. That works for me. |
Is there anything that can be done to make some progress on this issue? Is the current solution good enough? #60940 (comment)
|
It seems that This is disgusting, but it's precedent. Should we do the same for embedded file hashes? (We'd also need to increase the size of the stored hash, since it's only 128 bits right now.) |
Who can take the responsibility to answer this question? I would really appreciate if we could move this issue forward. |
Friendly ping: should we rely on the fact that |
I'll add another friendly ping, this would be a really great feature to have, thanks @oliverpool for driving it this far! |
I'm going to try to summarize where we are on this. The original proposalWe have an accepted proposal: #60940 (comment) The proposal is as follows: Add the following interface to package fs
type HashFileInfo interface {
FileInfo
Hash() []Hash
}
type Hash struct {
Algorithm string
Sum []byte
} The The Problems with embedded file hashesImplementation of this proposal (https://go.dev/cl/562875) encountered difficulty providing a sha256 hash for embedded files, since embedded files contain only a 16-byte hash consisting of a a truncated-and-modified sha256. We had some discussion over what name to use for this hash (since the We made some changes (#60940 (comment)) to the proposal in reaction to this situation, notably dropping the algorithm name from the Etag header. In the time since the last update to this proposal, there have been changes in the compiler's handling of hashes. Previously, it was using the "cmd/internal/notsha256" package to compute hashes. This notsha256 package was created to avoid problems with building with the boringcrypto experiment. The notsha256 package was removed in https://go.dev/cl/610599, and the compiler now uses a lightly-greased sha256 for hashing. (See cmd/internal/hash.) Embedded files still contain a truncated-and-modified sha256. The notable change here is that for obscure reasons involving boringcrypto the compiler used to need to avoid Questions about hash lengthThere was much debate about whether The argument in favor, as I understand it, is that CRC-32 is a hash, and therefore we should return it. The argument against is that a 32-bit hash is far too small to avoid accidental collisions. Questions about the APIThe The I suggested an alternate interface that avoids allocations:
Proposed path forwardI propose that:
Proposal committee: Does this sound reasonable? In particular, should we proceed with the original |
The garbage factory problem is concerning because package fs
type HashFileInfo interface {
FileInfo
Hash() iter.Seq[Hash]
}
type Hash struct {
Algorithm string
Base64 string
} Many users will end up wanting the Sum in base64 anyway, so it saves a step. |
@earthboundkid In @neild 's suggestion, the |
I believe In HTTP, 304 caching is validated based on time or hash values, using the commonly used headers If func setEtag(w http.ResponseWriter, r *http.Request) {
path := "..."
file, err := os.Open(path)
if err != nil {
return err
}
defer file.Close()
stat, _ := file.Stat()
w.Header().Set("Content-Type", "...") // If not set, http.ServeContent will auto-detect
w.Header().Set("Etag", computeEtagFromStat(stat)) // Pass ETag to http.ServeContent
http.ServeContent(w, r, stat.Name(), stat.ModTime(), file)
} In scenarios such as Nginx static file serving, If |
I was talking about the accepted interface: type HashFileInfo interface {
FileInfo
Hash() []Hash
}
type Hash struct {
Algorithm string
Sum []byte
} Which as Neil said is, "either unsafe or a garbage factory" even if the Sum is precomputed. My concern about the |
@eudore The entire point of the original proposal is to automatically serve Etag headers for cases where we have a precomputed hash, in particular when serving from an embedded filesystem. |
My hypothesis is that this doesn't matter, for the reason you give: Either you don't care what the hash is and will take whatever is available, or you are looking for a specific hash. Etag serving doesn't care what the hash is, for example. But maybe there's a case I'm not thinking of that really would want enumeration. net/http is a bit of a garbage factory in general, so a couple of small allocations per file served probably doesn't really matter much, in which case the interface as proposed is fine. |
I would like to point out that https://go.dev/cl/562875 does neither: it iterates over the available hashes and takes the first one suitable (with a length between 128 and 264 bits currently). I believe such logic to not be uncommon. For instance, with SRI you also want to walk though the algorithms to find a suitable one (currently the allowed prefixes are sha256, sha384, and sha512 - arguably in this case requesting 3 hashes could be an acceptable tradeoff).
Especially if it allows skipping sending a complete file (which is the very goal of the net/http change: reply with an empty 304 response instead of re-serving the whole file). Thank you @neild for your persistance! I agree with the path forward you proposed (my comments above do not oppose to it). Feel free to let me know when this "proposed path forward" is considered accepted, so that I can rework my CL. |
Can I go with the path forward proposed by @neild in #60940 (comment) ? Reading the comments since then, I don't see any disagreement or new concerns. The proposal is not perfect, but I believe that it is currently the best solution (for a 4 year old problem...). Adding this should immediately reduce bandwidth usage of all servers serving embedded files. |
Renewal of #43223
Here is a proposal which tries to address the concerns raised in #43223.
Accepted proposal
In io/fs, define
Then, in net/http.serveFile, serveFile calls Stat, and if the result implements HashFileInfo, it calls info.Hash. If that returns >=1 hashes, serveFile uses hash[0] as the Etag header, formatting it using Alg+":"+base64(Sum).
In package embed, the file type would add a Hash method and an assertion that it implements HashFileInfo. It would return a single hash with Algorithm “sha256”.
Original proposal (fs.File)
First, in io/fs, define
Second, in net/http, when serving a File (in
serveFile
, right beforeserveContent
for instance), if it implements ContentHashFile and the ContentHash method succeeds and is alphanumeric (no spaces, no Unicode, no symbols, to avoid any kind of header problems), use that result as the default ETag.Third, add the
ContentHash
method onhttp.ioFile
file (as a proxy to thefs.File
ContentHash
method).Fourth (probably out of scope for this proposal), add the
ContentHash
method onembed.FS
files.This proposal fixes the following objections:
The caller will simply get all available implementations and can filter them out.
The implementers are encouraged to indicate the algorithm used for each hash.
This one does.
This implementation cannot return an error (the implementer choose to panic. Returning
nil
seems better suited).It is currently very cumbersome, since the middleware would need to open the file as well (which means having the exact same logic regarding URL cleanup as the http.FileServer). Here is my attempt: https://git.sr.ht/~oliverpool/exp/tree/main/item/httpetag/fileserver.go (even uglier, since I have use
reflect
to retrieve the underlyingfs.File
from thehttp.File
).Could a "github-collaborator" post a message in #43223 to notify the people who engaged in previous proposal of this updated proposal?
The text was updated successfully, but these errors were encountered: