Skip to content

io/fs, net/http: define interface for automatic ETag serving #60940

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
oliverpool opened this issue Jun 22, 2023 · 64 comments
Open

io/fs, net/http: define interface for automatic ETag serving #60940

oliverpool opened this issue Jun 22, 2023 · 64 comments

Comments

@oliverpool
Copy link

oliverpool commented Jun 22, 2023

Renewal of #43223

In the discussion of io/fs and embed, a few people asked for automatic serving of ETag headers for static content, using content hashes.

Here is a proposal which tries to address the concerns raised in #43223.

Accepted proposal

In io/fs, define

// HashFileInfo provides hashes of the file content in constant time.
type HashFileInfo interface {
	FileInfo
	// Hash returns content hashes of the file that uniquely
	// identifies the file contents.
	//
	// Hash must NOT compute any hash of the file during the call.
	// That is, it must run in time O(1) not O(length of file).
	// If no content hash is already available, Hash should
	// return nil rather than take the time to compute one.
	//
	// The order of the returned hashes must be constant (preferred hashes first).
	Hash() []Hash
}
// Hash indicates the hash of a given content.
type Hash struct {
	// Algorithm indicates the algorithm used. Implementations are encouraged
	// to use package-like name for interoperability with other systems
	// (lowercase, without dash: e.g. sha256, sha1, crc32)
	Algorithm string
	// Sum is the result of the hash, it should not be modified by the caller.
	Sum []byte
}

Then, in net/http.serveFile, serveFile calls Stat, and if the result implements HashFileInfo, it calls info.Hash. If that returns >=1 hashes, serveFile uses hash[0] as the Etag header, formatting it using Alg+":"+base64(Sum).

In package embed, the file type would add a Hash method and an assertion that it implements HashFileInfo. It would return a single hash with Algorithm “sha256”.


Original proposal (fs.File)

First, in io/fs, define

// A ContentHashFile is a file that can return hashes of its content in constant time.
type ContentHashFile interface {
	fs.File

	// ContentHash returns content hashes of the file that uniquely
	// identifies the file contents.
	// The returned hashes should be of the form algorithm-base64.
	// Implementations are encouraged to use sha256, sha384, or sha512
	// as the algorithms and RawStdEncoding as the base64 encoding,
	// for interoperability with other systems (e.g. Subresource Integrity).
	//
	// ContentHash must NOT compute any hash of the file during the call.
	// That is, it must run in time O(1) not O(length of file).
	// If no content hash is already available, ContentHash should
	// return nil rather than take the time to compute one.
	ContentHash() []string
}

Second, in net/http, when serving a File (in serveFile, right before serveContent for instance), if it implements ContentHashFile and the ContentHash method succeeds and is alphanumeric (no spaces, no Unicode, no symbols, to avoid any kind of header problems), use that result as the default ETag.

func setEtag(w http.ResponseWriter, file File) {
	if ch, ok := file.(fs.ContentHashFile); ok {
		if w.Header().Get("Etag") != "" {
			return
		}
		for _, h := range ch.ContentHash() {
			// TODO: skip the hash if unsuitable (space, unicode, symbol)
			// TODO: should the etag be weak or strong?
			w.Header().Set("Etag", `W/"`+h+`"`)
			break
		}
	}
}

Third, add the ContentHash method on http.ioFile file (as a proxy to the fs.File ContentHash method).

Fourth (probably out of scope for this proposal), add the ContentHash method on embed.FS files.

This proposal fixes the following objections:

The API as proposed does not let the caller request a particular implementation.

The caller will simply get all available implementations and can filter them out.

The API as proposed does not let the implementation say which hash it used.

The implementers are encouraged to indicate the algorithm used for each hash.

The API as proposed does not let the implementation return multiple hashes.

This one does.

what is expected to happen if the ContentHash returns an error before transport?

This implementation cannot return an error (the implementer choose to panic. Returning nil seems better suited).

Drop this proposal and let third-party code fill this need.

It is currently very cumbersome, since the middleware would need to open the file as well (which means having the exact same logic regarding URL cleanup as the http.FileServer). Here is my attempt: https://git.sr.ht/~oliverpool/exp/tree/main/item/httpetag/fileserver.go (even uglier, since I have use reflect to retrieve the underlying fs.File from the http.File).


Could a "github-collaborator" post a message in #43223 to notify the people who engaged in previous proposal of this updated proposal?

@gopherbot gopherbot added this to the Proposal milestone Jun 22, 2023
@oliverpool
Copy link
Author

I have a hacky implementation available here:
https://git.sr.ht/~oliverpool/exp/tree/main/item/httpetag

@oliverpool
Copy link
Author

Thinking out loud (sorry for the noise), it seems even better to add an optional method to fs.FileInfo (instead of fs.File):

Updated proposal:

First, in io/fs, define

// A FileHashesInfo provides the file hashes in constant time.
type FileHashesInfo interface {
	fs.FileInfo

	// FileHashes returns content hashes of the file that uniquely
	// identifies the file contents.
	// The returned hashes should be of the form algorithm-base64,
	// and implementations are encouraged to use sha256, sha384, or sha512
	// as the algorithms and RawStdEncoding as base64 encoding,
	// for interoperability with other systems.
	//
	// FileHashes must NOT compute any hash of the file during the call.
	// That is, it must run in time O(1) not O(length of file).
	// If no content hash is already available, FileHashes should
	// return nil rather than take the time to compute one.
	FileHashes() []string
}

Second, in net/http, when serving a File (in serveFile, right before serveContent for instance), if its FileInfo implements FileHashesInfo and the FileHashes method succeeds and is alphanumeric (no spaces, no Unicode, no symbols, to avoid any kind of header problems), use that result as the default ETag.

func setEtag(w ResponseWriter, fi fs.FileInfo) {
	if ch, ok := fi.(FileHashesInfo); ok {
		if w.Header().Get("Etag") != "" {
			return
		}
		for _, h := range ch.FileHashes() {
			// TODO: skip the hash if unsuitable (define "suitable")
			// TODO: should the etag be weak or strong?
			w.Header().Set("Etag", `W/"`+h+`"`)
			break
		}
	}
}

Third (probably out of scope for this proposal), add the FileHashes method on embed.FS.*file (which implements the FileInfo interface).

@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals Jun 23, 2023
@rsc rsc moved this from Incoming to Active in Proposals Jul 12, 2023
@rsc
Copy link
Contributor

rsc commented Jul 12, 2023

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Jul 21, 2023

To summarize the proposal above:

  1. Add a new extension method to FileInfo, not File.
  2. To avoid argument about which hash to use, the method returns a slice of hashes.
  3. To identify the hashes, each hash in the slice is algorithm-base64, same format as Content-Security-Policy hashes.
  4. The web server sets multiple ETag headers, one for each hash.

I think we can probably improve on a few of these decisions.

  1. I am not sure about attaching the method to FileInfo. It is extremely unlikely that any FileInfo implementation would already have the hashes in its state. Instead, it would have to call back to some accessor method that uses the File handle. If the File has been closed, this may not be available anymore at all. It seems clearer to keep the method an extension of File than an extension of FileInfo. It will probably be less work for a File implementation to add a new exported method on File than to thread a new method on FileInfo back to a new unexported method on File.

     type HashFile struct {
         File
         Hash() ([]Hash, error)
     }
    

    seems fine to me.

  2. Nothing to improve here.

  3. That's a nice format, but it means you have to pick it apart with string manipulation to get the algorithm. I suggest instead having a

     package fs
     
     type Hash struct {
         Algorithm string
         Sum []byte
     }
    
     func (h Hash) String() string { ... }
    

    where the String method returns that standard CSP form. Clients who want the string can easily get it; clients who want the algorithm can easily get it; and clients who want a different form can easily compute it.

  4. I can't find anything that says it is legal to send back multiple ETag headers, and I can't see what that means if you want to send back an If-Match header - which one do you use? Instead I think we should let the fs decide what the preferred hash is and put that one first. Then the web server just uses the first hash as the ETag.

@oliverpool
Copy link
Author

Thanks for the feedback!

  1. Add a new extension method to FileInfo, not File.

I am not sure about attaching the method to FileInfo

Logically, I would put the ContentHash "near" the ModTime (hence my proposition to augment FileInfo).

Trying to be more concrete, I was able to find 3 implementations of fs.FS in the stdlib:

  • embed.FS: file is both the File and the FileInfo, so no difference here (for now at least)
  • os.DirFS: very unlikely to provide the pre-computed content hash. Must be wrapped to provide this feature. As you point out, it is probably a little bit more work, but I found it quite doable to augment the Stat method: https://git.sr.ht/~oliverpool/exp/tree/fileinfo/item/httpetag/embed.go
  • zip.Reader: CRC32 is stored in FileHeader (which is wrapped to provide the fs.FileInfo)

So the first case dos not really influence the decision.
The second case is in favor of File.
And the zip case favors a bit the FileInfo attachment.

For cases outside of the stdlib, I looked up S3 implementations and found that the hashes were returned when GETting the file or requesting the HEAD (so adding it to the FileInfo would mirror both ways of accessing the hashes, while attaching to File would prevent exposing it when doing a HEAD request).

Hash() ([]Hash, error)

I would drop the error (since the hashes should not be computed and likely retrieved along the other properties).

  1. To identify the hashes, each hash in the slice is algorithm-base64, same format as Content-Security-Policy hashes.

I really like your struct proposition, because it also simplifies the ETag logic: just encode the raw byte with an encoding producing the right characters!

  1. The web server sets multiple ETag headers, one for each hash.

My example code only sets the ETAg once. I think this should be sufficient. However to work fine, the implementer should:

  1. Always send the hashes in the same order (otherwise the ETag will unexpectedly change between requests)
  2. Send the "preferred" hashes first ("preferred" should be defined more precisely, maybe "strongest" in the cryptographic sense ?)

PS: do you think that dropping a comment in the previous proposal would be a good idea, to gather more feedback?

@oliverpool
Copy link
Author

Updated proposal, taking into accounts the comments above:

// ContentHashesInfo provides pre-computed hashes of the file contents.
type ContentHashesInfo interface {
	FileInfo

	// ContentHashes returns pre-computed hashes of the file contents.
	//
	// ContentHashes must NOT compute any hash of the file during the call.
	// That is, it must run in time O(1) not O(length of file).
	// If no content hash is already available, ContentHashes should
	// return nil rather than take the time to compute one.
	//
	// The order of the returned hash must be stable.
	ContentHashes() []Hash
}

// Hash indicates the hash of a given content.
type Hash struct {
	// Algorithm indicates the algorithm used. Implementations are encouraged
	// to use package-like name for interoperability with other systems
	// (lowercase, without dash: e.g. sha256, sha1, crc32)
	Algorithm string
	// Sum is the result of the hash, it should not be modified.
	Sum []byte
}

I have created a new fileinfo_struct branch in my demo code.

@rsc
Copy link
Contributor

rsc commented Sep 20, 2023

I'm still on the fence about FileInfo vs File, but I'm willing to try FileInfo and see how it goes. It seems like we are at:

type HashFileInfo interface {
    FileInfo
    Hash() []Hash
}

type Hash struct {
    Algorithm string
    Sum []byte
}

The remaining question in my reply above is (4), namely what does HTTP do when Hash returns multiple hashes? As far as I can tell it makes no sense to send back multiple ETag headers.

@oliverpool
Copy link
Author

oliverpool commented Sep 21, 2023

what does HTTP do when Hash returns multiple hashes?

I would suggest to use the first suitable hash. For instance taking the first one with at least 32 bits (and truncating it to 512 bits):

		if w.Header().Get("Etag") != "" {
			return
		}
		const minLen, maxLen = 4, 64
		for _, h := range ch.ContentHashes() {
			buf := h.Sum
			if len(buf) < minLen {
				// hash should have at least 32 bits
				continue
			}
			if len(buf) > maxLen {
				buf = buf[:maxLen]
			}
			// Strong etag: any encoding middleware should set it to weak.
			w.Header().Set("Etag", `"`+base64.RawStdEncoding.EncodeToString(buf)+`"`)
			break
		}

@willfaught
Copy link
Contributor

Nit: Hash() returns more than one Hash. Hashes()?

@rsc
Copy link
Contributor

rsc commented Oct 4, 2023

It seems fine for Hash to return []Hash. It doesn't have to be Hashes.
Using the first Hash as the Etag seems fine too.

Have all remaining concerns about this proposal been addressed?

@rsc
Copy link
Contributor

rsc commented Oct 11, 2023

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@rsc rsc moved this from Active to Likely Accept in Proposals Oct 11, 2023
@rsc
Copy link
Contributor

rsc commented Oct 26, 2023

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

The proposal details are as follows.

In io/fs, we add:

type HashFileInfo interface {
    FileInfo
    Hash() []Hash
}

type Hash struct {
    Algorithm string
    Sum []byte
}

Then, in net/http.serveFile, serveFile calls Stat, and if the result implements HashFileInfo, it calls info.Hash. If that returns >=1 hashes, serveFile uses hash[0] as the Etag header, formatting it using Alg+":"+base64(Sum).

In package embed, the file type would add a Hash method and an assertion that it implements HashFileInfo. It would return a single hash with Algorithm “sha256”.

@rsc rsc moved this from Likely Accept to Accepted in Proposals Oct 26, 2023
@rsc rsc changed the title proposal: io/fs, net/http: define interface for automatic ETag serving io/fs, net/http: define interface for automatic ETag serving Oct 26, 2023
@rsc rsc modified the milestones: Proposal, Backlog Oct 26, 2023
@mauri870
Copy link
Member

mauri870 commented Nov 2, 2023

@oliverpool If you're interested in working on this, feel free to send a patch

@earthboundkid
Copy link
Contributor

Should there be a package function in fs like

// Hashes returns fi.Hash() if fi implements HashFileInfo. Otherwise it returns nil.
func Hashes(fi FileInfo) []Hash {
    if hfi, ok := fi.(HashFileInfo); ok {
       return hfi.Hash()
    }
    return nil
}

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/562875 mentions this issue: fs: add HashFileInfo interface to enable ETag serving in net/http

@oliverpool
Copy link
Author

Should there be a package function

@earthboundkid I don't think that there is a need for it. However if usage shows that there is interest in such a package function, it can be added later.

@ianlancetaylor
Copy link
Contributor

@neild

I can't figure out why there's a hash in there; the embed package doesn't seem to use it.

Pretty sure it was put there specifically to support ETag. It's ready to be used when we are ready to use it.

@oliverpool
Copy link
Author

I propose that rather than baking this limitation into net/http, we avoid adding any implementations of HashFileInfo to the standard library that return hashes less than 128 bits. This means not implementing HashFileInfo in archive/zip.

I am not fully convinced by this line of reasoning. The goal of having an exported HashFileInfo interface is to allow anyone to implement it. So if we want net/http to behave correctly with 3rd party libraries, it must perform some checks on the hashes (we can't forbid 3rd-party libraries to exposes crc32 hashes or weaker). Having this check, I see no reason not to expose the crc32 hash in archive/zip (allowing its access through an interface is the very goal of an interface).

@oliverpool
Copy link
Author

I have updated the CL with @neild suggestions:

  • http: the algorithm name is not part of the ETag anymore
  • http: only hashes with a length between 128 and 264 bits are considered suitable
  • embed: the algorithm name is made to change on every release: go1.22.2/cmd/internal/notsha256[:16] (until the proper sha256 hash is exposed)

I am not sure that I would be able to change the hash of embed to a full sha256 and I would prefer if this could be done in a future CL.

I would appreciate any feedback, to move this forward.

@neild
Copy link
Contributor

neild commented Apr 8, 2024

Proposal committee, I'd appreciate your input here. This proposal is accepted, but there are some questions regarding it that need addressing. To summarize (see discussion above for details):

  1. The accepted proposal states that package embed will provide sha256 hashes. Embedded files currently contain an unexposed 128-bit hash. This hash is generated with cmd/internal/notsha256, which is explicitly not sha256 to avoid issues with the bootstrap process when GOEXPERIMENT=boringcrypto (see https://go.dev/cl/402594). Should embed use this hash or use sha256 as in the proposal? (Using sha256 will either require generating the hash at runtime, or figuring out another way around the bootstrap issue.)
  2. If embed should use the 128-bit notsha256 hash, what should the hash algorithm name be? (The HashFileInfo interface doesn't allow for unnamed, arbitrary hashes.)
  3. If embed should use the 128-bit notsha256 hash, should we apply grease to it by, for example, hashing the Go release into the value?
  4. The HashFileInfo interface requires allocating a slice and copying the hash value(s) on every call, if it is to avoid providing the caller with something that aliases private memory. Do we want to consider an allocation-free API instead (I suggested one above in io/fs, net/http: define interface for automatic ETag serving #60940 (comment))?

@mitar
Copy link
Contributor

mitar commented Apr 8, 2024

Isn't it maybe easier to make a generic fs wrapper with which you can wrap embed's fs which computes sha256 (or any other hash you really want)?

So then you could have something like:

//go:embed dir
var dirFiles embed.FS

var dirFilesWithHash = fs.WithHash(dirFiles, myHash)

This could then work with any source of a file system, like fstest.MapFS and so on.

@rsc
Copy link
Contributor

rsc commented May 8, 2024

I haven't replied here yet because every time I sit down to reply, I don't have any good answers. I will try to figure this out, but it will probably have to be after the Go 1.23 freeze. The intersection of cgo, sha256, boringcrypto, and these hashes makes things more difficult than they should be.

@rsc
Copy link
Contributor

rsc commented May 9, 2024

What if we define a new hash function "goembed1" which is notsha256("goembed1"+content) truncated to 128 bits?
We add the new interfaces and APIs as described in the accept message, but instead of providing a sha256 hash, we provide a goembed1 hash. This would leak the fact that the server is serving from a go:embed file system. I don't know if that matters.

@rsc
Copy link
Contributor

rsc commented May 9, 2024

I also agree with @neild's comment above that we should not add anything that returns a file hash < 128 bits to the standard library.

@mitar
Copy link
Contributor

mitar commented May 9, 2024

@rsc What about my proposal for a general fs.WithHash function? And we simply leave the internal embed hash internal. That seems to make make everything more composable, developer can control exactly which hash they want to use, and also testing is possible with non-embed but still fs types.

@rsc
Copy link
Contributor

rsc commented May 9, 2024

Anyone can write fs.WithHash themselves and they are welcome to. The point here was to provide a pre-computed hash that avoids the startup cost.

@neild
Copy link
Contributor

neild commented May 9, 2024

I don't have any particularly good answers either.

Perhaps we should say that go:embed files should provide a sha256 hash, and that figuring out whatever boringcrypto issues block providing one is a prerequisite for implementing that portion of this proposal?

We can add the fs.HashFileInfo interface even if embedded files don't implement it at first.

@oliverpool
Copy link
Author

notsha256("goembed1"+content) truncated to 128 bits

This sounds like a reasonable compromise and a good way forward in the meantime.

This would leak the fact that the server is serving from a go:embed file system. I don't know if that matters.

The leak would be fixed if the algorithm name is not considered for the ETag (as currently implemented, as suggested in https://go-review.googlesource.com/c/go/+/562875/19..26/src/net/http/fs.go#614). Or do you mean that the caller can recompute notsha256("goembed1"+content) and check the resulting hash against the ETag? (in this case the leak is more or less already present since go:embed files have not modtime when served).

we should not add anything that returns a file hash < 128 bits to the standard library

I would rather have the standard library makes use of (standard) interfaces as much as possible. The archive/zip package currently exposes the CRC32 hash in a specific way. I see no harm into exposing it as part of the interface. fs.HashFileInfo should not be considered a secure primitive (MD5 and SHA-1 produce 128+ bits and should be avoided whenever possible anyway).

We can add the fs.HashFileInfo interface even if embedded files don't implement it at first.

The core problem this proposal attempts to address is "automatic serving of ETag headers for static content, using content hashes" (as stated in #43223). I would prefer delaying this only if a clear and timed implementation strategy is worked out.

@neild
Copy link
Contributor

neild commented May 10, 2024

Or do you mean that the caller can recompute notsha256("goembed1"+content) and check the resulting hash against the ETag? (in this case the leak is more or less already present since go:embed files have not modtime when served).

Many files served by many web servers don't have a modification time. However, only files served by Go webservers using //go:embed files would have an ETag matching notsha256("goembed1"+content).

I don't know that this matters, but it would be an informational leak.

The archive/zip package currently exposes the CRC32 hash in a specific way. I see no harm into exposing it as part of the interface.

The proposed HashFileInfo documentation states that the hashes provided "uniquely identify" the file contents. While obviously no hash is a literal unique identifier, a 32-bit hash is short enough to be highly susceptible to birthday paradox collisions even outside adversarial scenarios, and cannot serve as a (pseudo-)unique identifier.

@rsc
Copy link
Contributor

rsc commented May 15, 2024

I hadn't seen the CL yet that drops the algorithm: from the Etag. That sounds like a good suggestion, and it eliminates the concerns about leaking that the hash is goembed1. That works for me.

@oliverpool
Copy link
Author

Is there anything that can be done to make some progress on this issue?

Is the current solution good enough? #60940 (comment)

  • http: the algorithm name is not part of the ETag anymore
  • http: only hashes with a length between 128 and 264 bits are considered suitable
  • embed: the algorithm name is made to change on every release: go1.22.2/cmd/internal/notsha256[:16] (until the proper sha256 hash is exposed - reminder: the name is ignored by the http handler)

@neild
Copy link
Contributor

neild commented Jun 4, 2024

It seems that cmd/internal/codesign is using notsha256 and inverting the result to produce an actually-sha256 hash.

This is disgusting, but it's precedent. Should we do the same for embedded file hashes? (We'd also need to increase the size of the stored hash, since it's only 128 bits right now.)

@oliverpool
Copy link
Author

Should we do the same for embedded file hashes?

Who can take the responsibility to answer this question?

I would really appreciate if we could move this issue forward.

@oliverpool
Copy link
Author

Friendly ping: should we rely on the fact that notsha256 is an inverted sha256, like cmd/internal/codesign already does?

@daddz
Copy link

daddz commented Nov 7, 2024

I'll add another friendly ping, this would be a really great feature to have, thanks @oliverpool for driving it this far!

@neild
Copy link
Contributor

neild commented Feb 14, 2025

I'm going to try to summarize where we are on this.

The original proposal

We have an accepted proposal: #60940 (comment)

The proposal is as follows:

Add the following interface to io/fs, defining a way for a filesystem to provide a precomputed hash or hashes for files:

package fs

type HashFileInfo interface {
    FileInfo
    Hash() []Hash
}

type Hash struct {
    Algorithm string
    Sum []byte
}

The net/http package will use this interface to provide an Etag header containing h.Algorithm + base64(h.Sum).

The embed package will implement HashFileInfo, returning a single hash with Algorithm "sha256".

Problems with embedded file hashes

Implementation of this proposal (https://go.dev/cl/562875) encountered difficulty providing a sha256 hash for embedded files, since embedded files contain only a 16-byte hash consisting of a a truncated-and-modified sha256. We had some discussion over what name to use for this hash (since the Hash struct requires an algorithm name).

We made some changes (#60940 (comment)) to the proposal in reaction to this situation, notably dropping the algorithm name from the Etag header.

In the time since the last update to this proposal, there have been changes in the compiler's handling of hashes. Previously, it was using the "cmd/internal/notsha256" package to compute hashes. This notsha256 package was created to avoid problems with building with the boringcrypto experiment. The notsha256 package was removed in https://go.dev/cl/610599, and the compiler now uses a lightly-greased sha256 for hashing. (See cmd/internal/hash.) Embedded files still contain a truncated-and-modified sha256.

The notable change here is that for obscure reasons involving boringcrypto the compiler used to need to avoid crypto/sha256, but those reasons no longer apply and it's now happily using it.

Questions about hash length

There was much debate about whether archive/zip should implement HashFileInfo, returning the CRC-32 hash contained in zip files.

The argument in favor, as I understand it, is that CRC-32 is a hash, and therefore we should return it.

The argument against is that a 32-bit hash is far too small to avoid accidental collisions.

Questions about the API

The HashFileInfo interface is either unsafe, or a garbage factory.

The Hash method returns a []Hash, and Hash contains a []byte. If the implementer of HashFileInfo returns a new []Hash and []byte on each call, it requires at least two allocations. If it returns the same []Hash on each call, the caller can accidentally modify it and change the hash.

I suggested an alternate interface that avoids allocations:

type HashFileInfo interface {
  FileInfo

  // Hash returns a hash of the content of the file.
  //
  // If algorithm is "", Hash returns a hash using an undefined algorighm.
  // Otherwise, Hash returns a hash using the named algorithm if available.
  // Hash returns "" if no hash is available.
  //
  // Hash does not compute any hash of the file during the call.
  // That is, it runs in time O(1), not O(length of file).
  Hash(algorithm Algorithm) string
}

Proposed path forward

I propose that:

  1. We stick with the original HashFileInfo interface, even if it does generate some garbage.
  2. We modify embed files to contain a sha-256 hash, rather than the 128-bit hash they contain today.
  3. The embed package implements HashFileInfo, returning a sha-256 hash.
  4. We make a general commitment to not implement HashFileInfo in the standard library for hashes smaller than 128 bits. The archive/zip package does not implement HashFileInfo.
  5. The net/http package serves Etag headers using the hash when available. It uses the base64-encoded hash as the tag value. It does not include the algorithm.

Proposal committee: Does this sound reasonable? In particular, should we proceed with the original HashFileInfo or attempt to avoid allocations on retrieving hash values? (This is one part that's going to be very hard to change in the future, so I'd like to get it right.)

@earthboundkid
Copy link
Contributor

The garbage factory problem is concerning because http.FileServerFS can't assume that an fs.FS is immutable, so it will need to look up the hash for every request. What if we tweak it to

package fs

type HashFileInfo interface {
    FileInfo
    Hash() iter.Seq[Hash]
}

type Hash struct {
    Algorithm string
    Base64 string
}

Many users will end up wanting the Sum in base64 anyway, so it saves a step.

@ianlancetaylor
Copy link
Contributor

@earthboundkid In @neild 's suggestion, the Hash method will return a precomputed string for any given file. While it's true that net/http will have to call the Hash method each time, no garbage will be generated.

@eudore
Copy link

eudore commented Feb 15, 2025

I believe net/http/fs.go is not directly related to the new io/fs.Hash interface, so http.ServeContent should not auto add an ETag. By default, it should rely on the Last-Modified header for 304 caching. If users need a custom ETag, they should set it manually.

In HTTP, 304 caching is validated based on time or hash values, using the commonly used headers Last-Modified and ETag.
For most static files, Last-Modified alone is sufficient to enable 304 caching, and ETag is not necessary.
For embed files, since the default Last-Modified value is zero, it can be set to the app startup time and sync across multiple deploy using an env variable.

If ETag is required, the hash can be computed and set before return the response via net/http:

func setEtag(w http.ResponseWriter, r *http.Request) {
	path := "..."

	file, err := os.Open(path)
	if err != nil {
		return err
	}
	defer file.Close()

	stat, _ := file.Stat()
	w.Header().Set("Content-Type", "...") // If not set, http.ServeContent will auto-detect
	w.Header().Set("Etag", computeEtagFromStat(stat)) // Pass ETag to http.ServeContent
	http.ServeContent(w, r, stat.Name(), stat.ModTime(), file)
}

In scenarios such as Nginx static file serving, ETag can be computed and used for caching validation. However, as a basic library, fs.go does not necessarily need to provide ETag.
I use http.ServeContent instead of writing directly with io.Copy(w, file), primarily to take advantage of its Content-Type detect, 304 handling, and Range response capabilities. The absence of ETag does not affect 304 caching.

If fs.go adds ETag calculation, which function should it be placed in: http.ServeFile or http.ServeContent.

@earthboundkid
Copy link
Contributor

@earthboundkid In @neild 's suggestion, the Hash method will return a precomputed string for any given file. While it's true that net/http will have to call the Hash method each time, no garbage will be generated.

I was talking about the accepted interface:

type HashFileInfo interface {
    FileInfo
    Hash() []Hash
}

type Hash struct {
    Algorithm string
    Sum []byte
}

Which as Neil said is, "either unsafe or a garbage factory" even if the Sum is precomputed.

My concern about the Hash(algorithm Algorithm) string interface is you can't enumerate what's available, but maybe that doesn't matter in practice, since you either want a particular exact algorithm or you don't care and will take whatever is available.

@neild
Copy link
Contributor

neild commented Feb 16, 2025

@eudore The entire point of the original proposal is to automatically serve Etag headers for cases where we have a precomputed hash, in particular when serving from an embedded filesystem.

@neild
Copy link
Contributor

neild commented Feb 16, 2025

My concern about the Hash(algorithm Algorithm) string interface is you can't enumerate what's available, but maybe that doesn't matter in practice, since you either want a particular exact algorithm or you don't care and will take whatever is available.

My hypothesis is that this doesn't matter, for the reason you give: Either you don't care what the hash is and will take whatever is available, or you are looking for a specific hash. Etag serving doesn't care what the hash is, for example.

But maybe there's a case I'm not thinking of that really would want enumeration.

net/http is a bit of a garbage factory in general, so a couple of small allocations per file served probably doesn't really matter much, in which case the interface as proposed is fine.

@oliverpool
Copy link
Author

Either you don't care what the hash is and will take whatever is available, or you are looking for a specific hash.

I would like to point out that https://go.dev/cl/562875 does neither: it iterates over the available hashes and takes the first one suitable (with a length between 128 and 264 bits currently).

I believe such logic to not be uncommon. For instance, with SRI you also want to walk though the algorithms to find a suitable one (currently the allowed prefixes are sha256, sha384, and sha512 - arguably in this case requesting 3 hashes could be an acceptable tradeoff).

a couple of small allocations per file served probably doesn't really matter much

Especially if it allows skipping sending a complete file (which is the very goal of the net/http change: reply with an empty 304 response instead of re-serving the whole file).


Thank you @neild for your persistance! I agree with the path forward you proposed (my comments above do not oppose to it).

Feel free to let me know when this "proposed path forward" is considered accepted, so that I can rework my CL.

@oliverpool
Copy link
Author

Can I go with the path forward proposed by @neild in #60940 (comment) ?

Reading the comments since then, I don't see any disagreement or new concerns. The proposal is not perfect, but I believe that it is currently the best solution (for a 4 year old problem...).

Adding this should immediately reduce bandwidth usage of all servers serving embedded files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

No branches or pull requests