-
Notifications
You must be signed in to change notification settings - Fork 83
ISO-2022-JP encoder: convert halfwidth Katakana to fullwidth #106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Unicode Normalization Form C by Mark Pilgrim:
|
Source: https://htmlpreview.github.io/?https://raw.githubusercontent.com/dpk/diveintomark/master/archive-html/0f4e974c28574ab9eb0b895d2f5cef97c2f5163a.html. Even has a comment from Henri! |
Code inspection shows we need to start at U+FF61: http://searchfox.org/mozilla-central/source/intl/icu/source/common/ucnv2022.cpp#1597. Fix coming up. |
9702b0e
to
44decb1
Compare
This doesn't match Gecko and Blink for at least U+FF9F and U+FF9E. Those normalize to U+309A and U+3099 (at least according to the Unicode.org PDF code tables and Python's I'd prefer to have a new index for this stuff in the JSON file instead of relying on the definition of normalization by reference, when Python's impl at least doesn't yield the expected results. |
Yeah, it does seem like for those two code points there is a mismatch. We could also special case those two code points, but I won't object to a new index I suppose. |
Assuming you're not okay with special casing the code points, shall we call this "index ISO-2022-JP halfwidth Katakana"? I guess I'll explain in the note about the index that it's effectively the NFKC transformation of the input code points, with the exception of those two code points you mention and what those end up mapping to. That way it can also be implemented in that manner. |
I'd very much prefer a new index over special case plus reference to NFKC, because the latter risks inheriting bugs from whatever NFKC implementation the implementor uses to generate the table. An index where the pointer is half-width code point minus 0xFF61 and the value is the code point to be looked up in index-jis0208 would be fine. That is, I think the new index doesn't doesn't need to flatten the logically subsequent index-0208 pointer lookup. |
Ping @jungshik - I'm going to assume we'd prefer new index, at the very least for ease of testing. (Also, Mark Pilgrim appreciated the shout-out.) |
Can the index deal with composing? For example, will U+FF76 U+FF9E ("ガ") be converted to U+30AC ("ガ") instead of U+30AB U+309B ("カ゛")? At least Gecko will convert it to U+30AC. I don't know about Blink nor Edge. Probably this is the reason why NFKC uses combining characters despite that no Japanese legacy encodings have them. But if U+3099 or U+309A is left after conversion, they should be converted to U+309B or U+309C, respectively. |
An index would not be able to do that. Neither Chromium nor WebKit have such behavior, due to ICU lacking it: http://searchfox.org/mozilla-central/source/intl/icu/source/common/ucnv2022.cpp#1597 (this code is not actually used by Firefox as I initially thought). Edge does not have this behavior either, which I checked with https://dump.testsuite.org/encoding/iso-2022-jp/encode.html as Henri's tool doesn't work in Edge. The code for Firefox is at http://searchfox.org/mozilla-central/source/intl/uconv/ucvja/nsUnicodeToISO2022JP.cpp which has Neither code references NFKC (and the way I used NFKC wouldn't do the combining either). I would suggest we don't keep the combining behavior given these results. |
I think we shouldn't make the encoder consider more than one code point to decide what to output, because:
|
@hsivonen I agree with you to (most of) your points. I'd not care much about what we do with ISO-2022-JP (it's retained among pesky ISO-2022-derived legacy encoding only because it is 'relatively' widespread, but it does not change the fact that it's extremely rare these days and calling it a fringe encoding is justified) . That encoding (along with other encodings) should not be used any more as everybody here agrees. Reinterpreting their behavior or improving it in light of Unicode is not worth much effort. @inexorabletash I'd argue for using index. |
Note that this PR's patch uses the index approach now. Review appreciated. |
Hmm, reading https://en.wikipedia.org/wiki/Katakana suggests I should lowercase katakana when it's not the first word. |
What is "lowercase katakana"? I searched the page for "lower" and "first", but I couldn't find such a statement. And Katakana is a caseless script AFAIK. |
I meant an editorial (non-normative) change from "Katakana" to "katakana". Everything normative in the PR is how I think it should be. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you.
IE and Edge don't have the combining behavior in URL query but do have it in form submission. |
Fixes #105.
Preview | Diff