ISO-2022-JP encoder: convert halfwidth Katakana to fullwidth #106

annevk · 2017-05-05T11:10:46Z

Fixes #105.

annevk · 2017-05-05T11:17:55Z

Unicode Normalization Form C by Mark Pilgrim:

I was walking across a bridge one day, and I saw a man standing on the edge, about to jump off. So I ran over and said, “Stop! Don’t do it!”

“I can’t help it,” he cried. “I’ve lost my will to live.”

“What do you do for a living?” I asked.

He said, “I create web services specifications.”

“Me too!” I said. “Do you use REST web services or SOAP web services?”

He said, “REST web services.”

“Me too!” I said. “Do you use text-based XML or binary XML?”

He said, “Text-based XML.”

“Me too!” I said. “Do you use XML 1.0 or XML 1.1?”

He said, “XML 1.0.”

“Me too!” I said. “Do you use UTF-8 or UTF-16?”

He said, “UTF-8.”

“Me too!” I said. “Do you use Unicode Normalization Form C or Unicode Normalization Form KC?”

He said, “Unicode Normalization Form KC.”

“Die, heretic scum!” I shouted, and I pushed him over the edge.

(with apologies to Emo Philips)

annevk · 2017-05-05T11:26:25Z

Source: https://htmlpreview.github.io/?https://raw.githubusercontent.com/dpk/diveintomark/master/archive-html/0f4e974c28574ab9eb0b895d2f5cef97c2f5163a.html. Even has a comment from Henri!

annevk · 2017-05-05T11:42:47Z

Code inspection shows we need to start at U+FF61: http://searchfox.org/mozilla-central/source/intl/icu/source/common/ucnv2022.cpp#1597. Fix coming up.

Fixes #105.

hsivonen · 2017-05-05T15:58:06Z

This doesn't match Gecko and Blink for at least U+FF9F and U+FF9E. Those normalize to U+309A and U+3099 (at least according to the Unicode.org PDF code tables and Python's unicodedata), but neither of those is in JIS X 0208. The non-combining variants are in JIS X 0208 and are used by Gecko and Blink: U+309C and U+309B.

I'd prefer to have a new index for this stuff in the JSON file instead of relying on the definition of normalization by reference, when Python's impl at least doesn't yield the expected results.

annevk · 2017-05-05T17:10:44Z

Yeah, it does seem like for those two code points there is a mismatch. We could also special case those two code points, but I won't object to a new index I suppose.

annevk · 2017-05-05T18:21:45Z

Assuming you're not okay with special casing the code points, shall we call this "index ISO-2022-JP halfwidth Katakana"? I guess I'll explain in the note about the index that it's effectively the NFKC transformation of the input code points, with the exception of those two code points you mention and what those end up mapping to. That way it can also be implemented in that manner.

hsivonen · 2017-05-05T18:36:28Z

I'd very much prefer a new index over special case plus reference to NFKC, because the latter risks inheriting bugs from whatever NFKC implementation the implementor uses to generate the table.

An index where the pointer is half-width code point minus 0xFF61 and the value is the code point to be looked up in index-jis0208 would be fine. That is, I think the new index doesn't doesn't need to flatten the logically subsequent index-0208 pointer lookup.

inexorabletash · 2017-05-05T20:58:07Z

Ping @jungshik - I'm going to assume we'd prefer new index, at the very least for ease of testing.

(Also, Mark Pilgrim appreciated the shout-out.)

vyv03354 · 2017-05-06T01:18:45Z

Can the index deal with composing? For example, will U+FF76 U+FF9E ("ｶﾞ") be converted to U+30AC ("ガ") instead of U+30AB U+309B ("カ゛")? At least Gecko will convert it to U+30AC. I don't know about Blink nor Edge.

Probably this is the reason why NFKC uses combining characters despite that no Japanese legacy encodings have them. But if U+3099 or U+309A is left after conversion, they should be converted to U+309B or U+309C, respectively.

annevk · 2017-05-06T05:24:24Z

Can the index deal with composing?

An index would not be able to do that. Neither Chromium nor WebKit have such behavior, due to ICU lacking it: http://searchfox.org/mozilla-central/source/intl/icu/source/common/ucnv2022.cpp#1597 (this code is not actually used by Firefox as I initially thought).

Edge does not have this behavior either, which I checked with https://dump.testsuite.org/encoding/iso-2022-jp/encode.html as Henri's tool doesn't work in Edge.

The code for Firefox is at http://searchfox.org/mozilla-central/source/intl/uconv/ucvja/nsUnicodeToISO2022JP.cpp which has gBasicMapping taking care of the "index" (I suspect this is where I got "basics" from in the web-platform-tests regression Henri found) and ConvertHankaku that deals with the modifiers.

Neither code references NFKC (and the way I used NFKC wouldn't do the combining either).

I would suggest we don't keep the combining behavior given these results.

hsivonen · 2017-05-06T07:58:39Z

I think we shouldn't make the encoder consider more than one code point to decide what to output, because:

Chromium, WebKit and Edge don't have that behavior.
ISO-2022-JP is a fringe encoding on the Web, so we shouldn't try to improve it from what most browsers do and it's better to have cross-browser consistent behavior than for Firefox to retain an arguably elegant special behavior. (I'm aware of the email situation)
The encoder side of ISO-2022-JP is even more fringe on the Web, so we shouldn't try to improve it from what most browsers do and it's better to have cross-browser consistent behavior than for Firefox to retain an arguably elegant special behavior (and IMO email clients should follow Gmail's and Apple Mail's lead and always send UTF-8).
Half-width Katakana is a fringe feature in today's non-terminal environments that browsers exist in. So much so that ISO-2022-JP can't represent it. It's not like many users will be inputting half-width Katakana.
No other encoding considers more than one code point when deciding what to output. (Admittedly "no other encoding" isn't a strong reason, because ISO-2022-JP already has "no other encoding" behaviors that complicate the API surface for encoders.)

jungshik · 2017-05-06T22:21:21Z

@hsivonen I agree with you to (most of) your points. I'd not care much about what we do with ISO-2022-JP (it's retained among pesky ISO-2022-derived legacy encoding only because it is 'relatively' widespread, but it does not change the fact that it's extremely rare these days and calling it a fringe encoding is justified) . That encoding (along with other encodings) should not be used any more as everybody here agrees. Reinterpreting their behavior or improving it in light of Unicode is not worth much effort.

@inexorabletash I'd argue for using index.

annevk · 2017-05-07T04:41:43Z

Note that this PR's patch uses the index approach now. Review appreciated.

annevk · 2017-05-07T05:43:42Z

Hmm, reading https://en.wikipedia.org/wiki/Katakana suggests I should lowercase katakana when it's not the first word.

vyv03354 · 2017-05-07T08:00:23Z

Hmm, reading https://en.wikipedia.org/wiki/Katakana suggests I should lowercase katakana when it's not the first word.

What is "lowercase katakana"? I searched the page for "lower" and "first", but I couldn't find such a statement. And Katakana is a caseless script AFAIK.

annevk · 2017-05-07T08:20:15Z

I meant an editorial (non-normative) change from "Katakana" to "katakana". Everything normative in the PR is how I think it should be.

hsivonen

LGTM. Thank you.

hsivonen · 2017-05-11T07:51:07Z

Chromium, WebKit and Edge don't have that behavior.

IE and Edge don't have the combining behavior in URL query but do have it in form submission.

ISO-2022-JP encoder: convert halfwidth Katakana to fullwidth

44decb1

Fixes #105.

annevk force-pushed the annevk/iso-2022-jp-halfwidth-katakana branch from 9702b0e to 44decb1 Compare May 5, 2017 13:08

annevk requested a review from hsivonen May 5, 2017 13:22

annevk added 2 commits May 6, 2017 15:36

use an index

3e65457

nits

f372aac

lowercase katakana except at start of sentence

1921f8d

hsivonen approved these changes May 8, 2017

View reviewed changes

annevk merged commit 5a09856 into master May 8, 2017

annevk deleted the annevk/iso-2022-jp-halfwidth-katakana branch May 8, 2017 09:00

hsivonen mentioned this pull request Jun 15, 2017

ISO 2022-jp encoding/decoding support #60

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISO-2022-JP encoder: convert halfwidth Katakana to fullwidth #106

ISO-2022-JP encoder: convert halfwidth Katakana to fullwidth #106

annevk commented May 5, 2017 •

edited by pr-preview bot

Loading

annevk commented May 5, 2017

annevk commented May 5, 2017

annevk commented May 5, 2017

hsivonen commented May 5, 2017 •

edited

Loading

annevk commented May 5, 2017

annevk commented May 5, 2017

hsivonen commented May 5, 2017

inexorabletash commented May 5, 2017

vyv03354 commented May 6, 2017

annevk commented May 6, 2017

hsivonen commented May 6, 2017

jungshik commented May 6, 2017 •

edited

Loading

annevk commented May 7, 2017

annevk commented May 7, 2017

vyv03354 commented May 7, 2017 •

edited

Loading

annevk commented May 7, 2017 •

edited

Loading

hsivonen left a comment

hsivonen commented May 11, 2017

ISO-2022-JP encoder: convert halfwidth Katakana to fullwidth #106

ISO-2022-JP encoder: convert halfwidth Katakana to fullwidth #106

Conversation

annevk commented May 5, 2017 • edited by pr-preview bot Loading

annevk commented May 5, 2017

annevk commented May 5, 2017

annevk commented May 5, 2017

hsivonen commented May 5, 2017 • edited Loading

annevk commented May 5, 2017

annevk commented May 5, 2017

hsivonen commented May 5, 2017

inexorabletash commented May 5, 2017

vyv03354 commented May 6, 2017

annevk commented May 6, 2017

hsivonen commented May 6, 2017

jungshik commented May 6, 2017 • edited Loading

annevk commented May 7, 2017

annevk commented May 7, 2017

vyv03354 commented May 7, 2017 • edited Loading

annevk commented May 7, 2017 • edited Loading

hsivonen left a comment

Choose a reason for hiding this comment

hsivonen commented May 11, 2017

annevk commented May 5, 2017 •

edited by pr-preview bot

Loading

hsivonen commented May 5, 2017 •

edited

Loading

jungshik commented May 6, 2017 •

edited

Loading

vyv03354 commented May 7, 2017 •

edited

Loading

annevk commented May 7, 2017 •

edited

Loading