- https://www.w3.org/International/articles/idn-and-iri/ (most useful)
- http://xahlee.info/js/url_encoding_unicode.html (quite old but useful)
How i18n is achieved is different for the domain and path parts of a URL
- Domain: Uses IDN (punycode)
- Steps
- Browser converts whatever original encoding is in typed text to Unicode
- Browser normalises the Unicode
- Browser punycode encodes each label in the domain
- Browser makes DNS query for the punycoded encoded domain
- Browser gets IP address
- Request continues as normal
- Steps
- Path: Uses IRI (Internationalised resource identifiers)
- Steps
- Browser converts whatever original encoding is to Unicode
- Browser normalises the Unicode
- Browser encodes the Unicode codepoints as UTF-8
- Browser percent-encodes the UTF-8 encoded bytes
- Browser sends the request
- Server percent decodes the path, interprets the UTF-8, converts it into the Unicode codepoints
- Server converts Unicode to whatever the underlying filesystem uses
- Server finds the resource and returns it to the browser
- Steps
Easiest: https://www.punycoder.com/
Ruby:
# First:
#
# $ gem install simpleidn
#
require "simpleidn"
punycode_domain = "xn--mllerriis-l8a.com"
unicode_domain = SimpleIDN.to_unicode(punycode_domain)
=> "møllerriis.com"
SimpleIDN.to_ascii(unicode_domain)
=> "xn--mllerriis-l8a.com"
# Maori version
http://māori-example.rabidapp.nz
# * when pasted into browser address bar, browser will actually send the
# punycode version to server
# Punycode version
http://xn--mori-example-7mb.rabidapp.nz
# * when pasted into browser address bar, browser will convert it to Unicode
# for display
# "Englishized" version (you should register this to ensure that those who
# cannot easily type non-ASCII
http://maori-example.rabidapp.nz
Consider this unusual URL
https://arstechnica.com/science/2019/05/the-oceans-absorbed-extra-co₂-in-the-2000s/
- Note that the non-ASCII char is in the path not the domain (this is relevant)
- It has a "subscript 2" unicode character which is NOT percent encoded
- Unicode "subscript two"
- https://www.fileformat.info/info/unicode/char/2082/index.htm
- is 0x2082,
- encoded as UTF-8 as 0xE28282
- Unicode "subscript two"
- Browsers will actually submit the IRI encoded form but may show the Unicode form in the URL
- Modern browsers seem to show the Unicode in the URL bar
- Firefox and Chrome
- when you paste the URL above into the URL bar, firefox will show the
original URL but if you copy it from the URL bar it will copy
https://arstechnica.com/science/2019/05/the-oceans-absorbed-extra-co%E2%82%82-in-the-2000s/
- when you paste the URL above into the URL bar, firefox will show the
original URL but if you copy it from the URL bar it will copy
- Safari
- Copies the Unicode version not the IRI encoded version e.g the URL copied will be
https://arstechnica.com/science/2019/05/the-oceans-absorbed-extra-co₂-in-the-2000s/
in Safari.
- Copies the Unicode version not the IRI encoded version e.g the URL copied will be
- Browsers do automatic converstion of URIs with Unicode chars
- it varies somewhat between browser
- will automatically convert a URL with Unicode chars in it to a percent encoded form