Skip to content

Add option to interpret request headers as Latin1 #18255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 14, 2020

Conversation

halter73
Copy link
Member

@halter73 halter73 commented Jan 10, 2020

TODONE: Add tests.

Fixes #17399

@halter73 halter73 requested a review from Tratcher January 10, 2020 03:57
@ghost ghost added the area-servers label Jan 10, 2020
{
_context = context;

ServerOptions = ServiceContext.ServerOptions;
HttpRequestHeaders = new HttpRequestHeaders(reuseHeaderValues: !ServerOptions.DisableStringReuse);
_useLatin1 = isHttp1 && ServerOptions.Latin1RequestHeaders;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why just HTTP1? HTTP/2 would be affected the same way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we talked about it in the meeting with @lodejard, @mjsabby and @timmydo, they said this change was for legacy clients and that requiring UTF-8 for HTTP/2 clients was fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did that come up specifically? I don't recall...

Taking a quick look at the HTTP/2 and HPACK specs it doesn't appear that a character set encoding is specified for the header values - all I can see is a reference to them as octets.

I would say the server shouldn't take the transport version into account. Especially because intermediate gateways between client and server could shift the connection between HTTP/1.* and HTTP/2 --- the header values would be the same end-to-end but the http version is only the same hop-by-hop.

// 64-bit: Loop longs by default
while (input <= end - sizeof(long))
{
output[0] = (char)input[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use Bmi2.X64.ParallelBitDeposit if supported. Same for the 32-bit path.
See

if (Bmi2.X64.IsSupported)
{
// BMI2 will work regardless of the processor's endianness.
((ulong*)output)[0] = Bmi2.X64.ParallelBitDeposit((ulong)value, 0x00FF00FF_00FF00FFul);
((ulong*)output)[1] = Bmi2.X64.ParallelBitDeposit((ulong)(value >> 32), 0x00FF00FF_00FF00FFul);
}
else
{
output[0] = (char)input[0];
output[1] = (char)input[1];
output[2] = (char)input[2];
output[3] = (char)input[3];
output[4] = (char)input[4];
output[5] = (char)input[5];
output[6] = (char)input[6];
output[7] = (char)input[7];
}

do
{
var vector = Unsafe.AsRef<Vector<sbyte>>(input);
Vector.Widen(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this method is hot, you could add explicit paths for AVX2, SSE2, because

  • Vector operates on the widest op-set available, i.e. AVX, so some iterations need to be done sequentially instead of parallel as SSE would allow
  • Vector-methods may result in somewhat not ideal codegen.

For reference see

public static unsafe bool TryGetAsciiString(byte* input, char* output, int count)
from #17556

Except CheckBytesInAsciiRange both methods are similar. So they could be combined to one, and with some tricks let the JIT generate the proper code for both flavors.
I could do this as part of my PR, if this one gets merged first (also add the BMI-path, comment above).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't expect this method to be hot. This is mostly here so Bing can accept non-UTF8/ASCII encoded headers from legacy clients. This is opt-in via an undocumented IConfiguration flag. Even when opted-in, this method will only be hit to decode non-ASCII request headers.

I'll look into merging #17556 to master soon. I'll convert GetLatin1String to use your TryGetAsciiString optimizations when I do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is if we don't just end up using Encoding.Latin1 in .NET 5 which we very well might.

@gfoidl
Copy link
Member

gfoidl commented Jan 10, 2020

From #17399 (comment)

  • header-value octets 0x00-0x1F and 0x7F are unprintable USASCII and not expected to map to unicode char in any request header value strings

    • some unprintable ASCII is already consumed as http protocol delimiters

    • other unprintable ASCII will cause the request to be rejected as malformed

For the latter point I don't see any check in this PR. Will this case be handled at other parts in the stack? Should it be handled early in StringUtilities.GetLatin1String?

{
resultString = HeaderValueEncoding.GetString(buffer, span.Length);
StringUtilities.GetLatin1String(buffer, output, span.Length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should take this opportunity to refactor this code not to create a string instance and mutate its contents (see line 145 above).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get to that in 5.0. We'll want to limit changes for the 3.x patch.

@@ -115,6 +115,93 @@ public static unsafe bool TryGetAsciiString(byte* input, char* output, int count
return isValid;
}

[MethodImpl(MethodImplOptions.AggressiveOptimization)]
public static unsafe void GetLatin1String(byte* input, char* output, int count)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I'd rather this logic not be in aspnetcore. If this is pri-0 for you, let the API review team know that having a Encoding.Latin1 static property (see proposal at https://github.com/dotnet/corefx/issues/42686) is a necessity for you, and keep the optimized logic wholly within the runtime itself.

@GrabYourPitchforks
Copy link
Member

Ah, didn't realize this was the 3.x branch. Ignore my comments about waiting for runtime changes then. :)

@halter73
Copy link
Member Author

For the latter point I don't see any check in this PR. Will this case be handled at other parts in the stack? Should it be handled early in StringUtilities.GetLatin1String?

The HttpParser checks there are no carriage returns, and GetAsciiOrUTF8StringNonNullCharacters already checks there are no nulls. The previous version of GetAsciiOrUTF8StringNonNullCharacters did not reject other unprintable ASCII characters, and I decided not to add that extra validation as part of this patch.

if (IntPtr.Size == 8) // Use Intrinsic switch for branch elimination
{
// 64-bit: Loop longs by default
while (input <= end - sizeof(long))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clause should be end - input >= sizeof(long).

do
{
// If Vector not-accelerated or remaining less than vector size
if (!Vector.IsHardwareAccelerated || input > end - Vector<sbyte>.Count)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clause should be end - input <= Vector<sbyte>.Count.

input += sizeof(long);
output += sizeof(long);
}
if (input <= end - sizeof(int))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end - input >= sizeof(int)

}
else
{
// 32-bit: Loop ints by default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need for this branch. Just do loops of stride 8 on all platforms, and ignore the IntPtr.Size value.

output += sizeof(int);
}
}
if (input <= end - sizeof(short))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end - input >= sizeof(short)

// do/while as entry condition already checked
do
{
var vector = Unsafe.AsRef<Vector<sbyte>>(input);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsafe.ReadUnaligned, not Unsafe.AsRef.

var vector = Unsafe.AsRef<Vector<sbyte>>(input);
Vector.Widen(
vector,
out Unsafe.AsRef<Vector<short>>(output),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out these results to a local, then Unsafe.WriteUnaligned the local to the destination address.


input += Vector<sbyte>.Count;
output += Vector<sbyte>.Count;
} while (input <= end - Vector<sbyte>.Count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end - input >= Vector<sbyte>.Count

@GrabYourPitchforks
Copy link
Member

FWIW, my comments about using the pattern end - start >= length rather than start <= end - length is that it future-proofs the method against inputs like start = null, count = 0 being passed in the future. I realize that this isn't possible with the current PR, but utility methods like this have a tendency to be copied to other libraries or to gain more callers in the future, and those callers might not know about the implied "input cannot be null" assumption within this method. Rewriting the bounds checks as suggested here will future-proof the code against this possibility so that we don't inadvertently cause a buffer overrun in the future.

@halter73
Copy link
Member Author

@GrabYourPitchforks GetLatin1String is a copy of TryGetAsciiString. I plan on leaving it as-is in this patch assuming there's nothing actually broken with this implementation as it's being used today.

@gfoidl was looking at optimizing TryGetAsciiString with #17556. I was planning on merging #17556 into master soon and then changing the GetLatin1String implementation to match. Can you review #17556? Thanks.

@GrabYourPitchforks
Copy link
Member

@halter73 Then at the very least Debug.Assert at the start of the method that neither input is null. This will make clear that the method isn't resilient to this case.

@lodejard
Copy link
Contributor

@GrabYourPitchforks GetLatin1String is a copy of TryGetAsciiString. I plan on leaving it as-is in this patch assuming there's nothing actually broken with this implementation as it's being used today.

I like that approach, having latin1 as a parity copy of ascii and optimizing them together or becoming direct calls to core API when appropriate.

Adding to that - for partity it would be better to have TryGetLatin1String with its own copy of CheckBytesInLatin1Range -- it will still need isValid to be false for 0x00-0x1f.

If TryGetLatin1String only was called by a value.GetLatin1StringNonNullCharacters() (when useLatin1 is true) then there is no fallback logic needed at all and it will have the same performance as the ascii+utf8 case.

It would also not change the existing ascii+utf8 extension method's structure at all, which could be a small plus because that's the code path used by all existing customers.

@halter73
Copy link
Member Author

@lodejard I added GetLatin1StringNonNullCharacters like you suggested.

Adding to that - for partity it would be better to have TryGetLatin1String with its own copy of CheckBytesInLatin1Range -- it will still need isValid to be false for 0x00-0x1f.

TryGetLatin1String now calls into CheckBytesNotNull. CheckBytesNotNull doesn't verify there are no bytes in the 0x01-0x1f range like CheckBytesInLatin1Range would, because TryGetAsciiString also doesn't verify it.

HttpParser.TakeHeader rejects carriage returns inside a header, but that's it. While I agree it would be better to reject these characters, I don't want to start rejecting requests that would have been accepted in the default ASCII fast path as part of a patch.

@halter73 halter73 marked this pull request as ready for review January 11, 2020 01:57
@lodejard
Copy link
Contributor

@halter73 That makes sense - no reason to put them out of sync for that. Thanks!

Did you see the other remark about http1? I forgot to tag you so not sure if it was missed...

@halter73
Copy link
Member Author

Did you see the other remark about http1? I forgot to tag you so not sure if it was missed...

I saw that. I now support latin1 encoding for http/2 request headers as well. I added tests for both HTTP/1 and HTTP/2.

new KeyValuePair<string, string>(HeaderNames.Method, "GET"),
new KeyValuePair<string, string>(HeaderNames.Path, "/"),
new KeyValuePair<string, string>(HeaderNames.Scheme, "http"),
// The HPackEncoder will encode £ as 0xA3 aka Latin1 encoding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting that the encoder didn't need to be modified.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I though I was going to have to use a custom HPackEncoder, but it turns out that it's just casting chars to bytes. The response header's are normally validated by Kestrel's HttpResponseHeaders dictionary.

@jkotalik
Copy link
Contributor

@lodejard I chatted with @halter73 and I asked what can we do to let you dogfood this change to make sure it meets your expectations? Can you easily try a nightly build in your current setup?

@analogrelay
Copy link
Contributor

TODO: Add tests.

TODONE? Or still TODO?

@halter73
Copy link
Member Author

TODONE

@@ -115,6 +117,111 @@ public static unsafe bool TryGetAsciiString(byte* input, char* output, int count
return isValid;
}

[MethodImpl(MethodImplOptions.AggressiveOptimization)]
public static unsafe bool TryGetLatin1String(byte* input, char* output, int count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using unsafe code here instead of Span and Span? To match TryGetAsciiString?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I'm trying to keep this as close to TryGetAsciiString as possible since it's conceptually quite similar.

@analogrelay analogrelay added the Servicing-consider Shiproom approval is required for the issue label Jan 15, 2020
@analogrelay analogrelay added this to the 3.1.x milestone Jan 15, 2020
@leecow leecow added Servicing-approved Shiproom has approved the issue and removed Servicing-consider Shiproom approval is required for the issue labels Feb 4, 2020
@leecow leecow modified the milestones: 3.1.x, 3.1.3 Feb 4, 2020
@leecow
Copy link
Member

leecow commented Feb 4, 2020

[Tactics] - ensure Bing validates.

@ghost
Copy link

ghost commented Feb 6, 2020

Hello @halter73!

Because this pull request has the auto-merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

@ghost
Copy link

ghost commented Feb 6, 2020

Apologies, while this PR appears ready to be merged, it looks like release/3.1 is a protected branch and I have not been granted permission to perform the merge.

@halter73
Copy link
Member Author

halter73 commented Feb 6, 2020

Bing confirmed this change works for them. Can you please merge this @anurse?

@analogrelay
Copy link
Contributor

Branches aren't open for March, so no :). I'll do so as soon as they open though!

@analogrelay
Copy link
Contributor

Slightly nervous about the passing build being a month ago, but I'll merge it ;)

@analogrelay analogrelay merged commit 7fa6d19 into release/3.1 Feb 14, 2020
@analogrelay analogrelay deleted the halter73/17399-3.1 branch February 14, 2020 17:32
@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions feature-kestrel Servicing-approved Shiproom has approved the issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants