Skip to content

ISO-2022-JP encoder: convert halfwidth Katakana to fullwidth #106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 8, 2017

Conversation

annevk
Copy link
Member

@annevk annevk commented May 5, 2017

Fixes #105.


Preview | Diff

@annevk
Copy link
Member Author

annevk commented May 5, 2017

Unicode Normalization Form C by Mark Pilgrim:

I was walking across a bridge one day, and I saw a man standing on the edge, about to jump off. So I ran over and said, “Stop! Don’t do it!”

“I can’t help it,” he cried. “I’ve lost my will to live.”

“What do you do for a living?” I asked.

He said, “I create web services specifications.”

“Me too!” I said. “Do you use REST web services or SOAP web services?”

He said, “REST web services.”

“Me too!” I said. “Do you use text-based XML or binary XML?”

He said, “Text-based XML.”

“Me too!” I said. “Do you use XML 1.0 or XML 1.1?”

He said, “XML 1.0.”

“Me too!” I said. “Do you use UTF-8 or UTF-16?”

He said, “UTF-8.”

“Me too!” I said. “Do you use Unicode Normalization Form C or Unicode Normalization Form KC?”

He said, “Unicode Normalization Form KC.”

“Die, heretic scum!” I shouted, and I pushed him over the edge.

(with apologies to Emo Philips)

@annevk
Copy link
Member Author

annevk commented May 5, 2017

@annevk
Copy link
Member Author

annevk commented May 5, 2017

Code inspection shows we need to start at U+FF61: http://searchfox.org/mozilla-central/source/intl/icu/source/common/ucnv2022.cpp#1597. Fix coming up.

@annevk annevk force-pushed the annevk/iso-2022-jp-halfwidth-katakana branch from 9702b0e to 44decb1 Compare May 5, 2017 13:08
@annevk annevk requested a review from hsivonen May 5, 2017 13:22
@hsivonen
Copy link
Member

hsivonen commented May 5, 2017

This doesn't match Gecko and Blink for at least U+FF9F and U+FF9E. Those normalize to U+309A and U+3099 (at least according to the Unicode.org PDF code tables and Python's unicodedata), but neither of those is in JIS X 0208. The non-combining variants are in JIS X 0208 and are used by Gecko and Blink: U+309C and U+309B.

I'd prefer to have a new index for this stuff in the JSON file instead of relying on the definition of normalization by reference, when Python's impl at least doesn't yield the expected results.

@annevk
Copy link
Member Author

annevk commented May 5, 2017

Yeah, it does seem like for those two code points there is a mismatch. We could also special case those two code points, but I won't object to a new index I suppose.

@annevk
Copy link
Member Author

annevk commented May 5, 2017

Assuming you're not okay with special casing the code points, shall we call this "index ISO-2022-JP halfwidth Katakana"? I guess I'll explain in the note about the index that it's effectively the NFKC transformation of the input code points, with the exception of those two code points you mention and what those end up mapping to. That way it can also be implemented in that manner.

@hsivonen
Copy link
Member

hsivonen commented May 5, 2017

I'd very much prefer a new index over special case plus reference to NFKC, because the latter risks inheriting bugs from whatever NFKC implementation the implementor uses to generate the table.

An index where the pointer is half-width code point minus 0xFF61 and the value is the code point to be looked up in index-jis0208 would be fine. That is, I think the new index doesn't doesn't need to flatten the logically subsequent index-0208 pointer lookup.

@inexorabletash
Copy link
Member

Ping @jungshik - I'm going to assume we'd prefer new index, at the very least for ease of testing.

(Also, Mark Pilgrim appreciated the shout-out.)

@vyv03354
Copy link
Collaborator

vyv03354 commented May 6, 2017

Can the index deal with composing? For example, will U+FF76 U+FF9E ("ガ") be converted to U+30AC ("ガ") instead of U+30AB U+309B ("カ゛")? At least Gecko will convert it to U+30AC. I don't know about Blink nor Edge.

Probably this is the reason why NFKC uses combining characters despite that no Japanese legacy encodings have them. But if U+3099 or U+309A is left after conversion, they should be converted to U+309B or U+309C, respectively.

@annevk
Copy link
Member Author

annevk commented May 6, 2017

Can the index deal with composing?

An index would not be able to do that. Neither Chromium nor WebKit have such behavior, due to ICU lacking it: http://searchfox.org/mozilla-central/source/intl/icu/source/common/ucnv2022.cpp#1597 (this code is not actually used by Firefox as I initially thought).

Edge does not have this behavior either, which I checked with https://dump.testsuite.org/encoding/iso-2022-jp/encode.html as Henri's tool doesn't work in Edge.

The code for Firefox is at http://searchfox.org/mozilla-central/source/intl/uconv/ucvja/nsUnicodeToISO2022JP.cpp which has gBasicMapping taking care of the "index" (I suspect this is where I got "basics" from in the web-platform-tests regression Henri found) and ConvertHankaku that deals with the modifiers.

Neither code references NFKC (and the way I used NFKC wouldn't do the combining either).

I would suggest we don't keep the combining behavior given these results.

@hsivonen
Copy link
Member

hsivonen commented May 6, 2017

I think we shouldn't make the encoder consider more than one code point to decide what to output, because:

  1. Chromium, WebKit and Edge don't have that behavior.
  2. ISO-2022-JP is a fringe encoding on the Web, so we shouldn't try to improve it from what most browsers do and it's better to have cross-browser consistent behavior than for Firefox to retain an arguably elegant special behavior. (I'm aware of the email situation)
  3. The encoder side of ISO-2022-JP is even more fringe on the Web, so we shouldn't try to improve it from what most browsers do and it's better to have cross-browser consistent behavior than for Firefox to retain an arguably elegant special behavior (and IMO email clients should follow Gmail's and Apple Mail's lead and always send UTF-8).
  4. Half-width Katakana is a fringe feature in today's non-terminal environments that browsers exist in. So much so that ISO-2022-JP can't represent it. It's not like many users will be inputting half-width Katakana.
  5. No other encoding considers more than one code point when deciding what to output. (Admittedly "no other encoding" isn't a strong reason, because ISO-2022-JP already has "no other encoding" behaviors that complicate the API surface for encoders.)

@jungshik
Copy link

jungshik commented May 6, 2017

@hsivonen I agree with you to (most of) your points. I'd not care much about what we do with ISO-2022-JP (it's retained among pesky ISO-2022-derived legacy encoding only because it is 'relatively' widespread, but it does not change the fact that it's extremely rare these days and calling it a fringe encoding is justified) . That encoding (along with other encodings) should not be used any more as everybody here agrees. Reinterpreting their behavior or improving it in light of Unicode is not worth much effort.

@inexorabletash I'd argue for using index.

@annevk
Copy link
Member Author

annevk commented May 7, 2017

Note that this PR's patch uses the index approach now. Review appreciated.

@annevk
Copy link
Member Author

annevk commented May 7, 2017

Hmm, reading https://en.wikipedia.org/wiki/Katakana suggests I should lowercase katakana when it's not the first word.

@vyv03354
Copy link
Collaborator

vyv03354 commented May 7, 2017

Hmm, reading https://en.wikipedia.org/wiki/Katakana suggests I should lowercase katakana when it's not the first word.

What is "lowercase katakana"? I searched the page for "lower" and "first", but I couldn't find such a statement. And Katakana is a caseless script AFAIK.

@annevk
Copy link
Member Author

annevk commented May 7, 2017

I meant an editorial (non-normative) change from "Katakana" to "katakana". Everything normative in the PR is how I think it should be.

Copy link
Member

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you.

@annevk annevk merged commit 5a09856 into master May 8, 2017
@annevk annevk deleted the annevk/iso-2022-jp-halfwidth-katakana branch May 8, 2017 09:00
@hsivonen
Copy link
Member

Chromium, WebKit and Edge don't have that behavior.

IE and Edge don't have the combining behavior in URL query but do have it in form submission.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants