Skip to content

EUK-kr encoding/decoding support #62

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
r12a opened this issue Jun 20, 2016 · 10 comments
Closed

EUK-kr encoding/decoding support #62

r12a opened this issue Jun 20, 2016 · 10 comments

Comments

@r12a
Copy link
Collaborator

r12a commented Jun 20, 2016

Results for a series of tests for EUK-kr encoding/decoding can be found at
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#euckr

The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at
web-platform-tests/wpt#3201

The tests check whether:

  1. the browser produces the expected byte sequences for all characters in the euc-kr encoding after 0x9F when encoding bytes for a URL produced by a form, using the encoder steps in the specification.
  2. the browser produces percent-escaped character references for a URL produced by a form when encoding miscellaneous characters that are not in the euc-kr encoding. (tests for two ranges)
  3. same two types of test when writing characters to an href value
  4. the browser decodes all characters as expected from a file generated by encoding all pointers in the euc-kr encoding per the encoder steps in the specification.
  5. the browser decodes characters that are not recognised from the euc-kr encoding as replacement characters.

The following summarises the current situation according to my testing, for major desktop browsers. (I will be adding nightly results and perhaps other browsers in time.) The table lists the number of characters that were NOT successfully converted by the test.

screen shot 2016-06-20 at 17 24 51

Notes:

  • Edge fails all href encode tests because characters are not converted to percent-escapes in the href attribute.
  • Firefox fails all href encode tests for characters not in the encoding because it converts characters to percent-escaped Unicode values instead.

Can we please investigate the failures to ascertain whether:

  1. the browser needs to be changed
  2. the spec needs to be changed
  3. the test is at fault

The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/

@jungshik
Copy link

https://bugs.chromium.org/p/chromium/issues/detail?id=626396

Chrome's failure in form submission (note that Chromium passes href test 100%) with 28 (mostly Cf characters : https://goo.gl/HKf47P ) is likely to be caused by Blink's handling of those characters even before they reach the EUC-KR encoder. The encoder does not see them at all, which is why there's empty output.

@jungshik
Copy link

jungshik commented Sep 16, 2016

As for Edge's behavior, Edge must be interpreting EUC-KR label strictly (that is, interpreting it as NOT being able to encode 8,822 [1] Hangul syllables that are NOT a part of the original KS X 1001 when it was KS C 5601). Edge is lenient in the decoding direction, though.

@ri2a, have you tried using the label 'ks_c_5601-1987' instead? It'll be interesting to see how Edge treats that label. MS IE used that label to refer to Windows-949 (they should not !) even though KS C 5601-1987 does not have any provision to encode 8,822 Hangul syllables in the way Windows-949 encodes.

Firefox used to have even more strict interpretation. KS X 1001 (formerly KS C 5601) has a provision to encode 8,821 Hangul syllables with 8-byte sequences and Firefox used to encode them that way with EUC-KR. It does not do that anymore, I guess.

[1] 8,822 = 11,172 (# of all possible Hangul syllables in modern orthography) - 2,350 (encoded in KS X 1001).

@r12a
Copy link
Collaborator Author

r12a commented Sep 16, 2016

the alias label tests are here:
https://www.w3.org/International/tests/repo/results/encoding-dbl-byte-labels

At https://www.w3.org/International/tests/repo/results/encoding-dbl-byte-labels.en#euckr i tried out the ks_c_5601-1987 test, and it passed for all 17,048 characters checked, so your hypothesis may well be correct.

@jungshik
Copy link

@r12a, thanks for testing. Sigh...

@annevk annevk added the tests label Nov 16, 2016
@r12a
Copy link
Collaborator Author

r12a commented Jun 15, 2017

Today and yesterday i updated the results at https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#euckr for Firefox, FNightly, Chrome, and Canary. The latest summary is:

screen shot 2017-06-15 at 08 32 29

@hsivonen
Copy link
Member

Thank you. The EUC-KR tests LGTM for merging into WPT. /cc @domenic

@domenic
Copy link
Member

domenic commented Jun 15, 2017

Let's close this as web-platform-tests/wpt#6258 is ready to merge.

@domenic domenic closed this as completed Jun 15, 2017
@domenic
Copy link
Member

domenic commented Jun 16, 2017

Reopening per #61 (comment)

@annevk
Copy link
Member

annevk commented Oct 17, 2018

Now that Firefox passes all these tests and a year has passed, I'm happy to consider this done. A new issue would also be less noisy at this point, were one warranted.

@annevk annevk closed this as completed Oct 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants