From 91c95a18e6aa0f5f687bab55a43fa0290afb6f1c Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Sun, 9 May 2021 20:26:02 -0700 Subject: [PATCH] Expand on Unicode identifiers. --- src/identifiers.md | 62 ++++++++++++++++++++++++++++++++++------------ 1 file changed, 46 insertions(+), 16 deletions(-) diff --git a/src/identifiers.md b/src/identifiers.md index e57be8407..c6361c4db 100644 --- a/src/identifiers.md +++ b/src/identifiers.md @@ -2,8 +2,8 @@ > **Lexer:**\ > IDENTIFIER_OR_KEYWORD :\ ->       XID_start XID_continue\*\ ->    | `_` XID_continue+ +>       XID_Start XID_Continue\*\ +>    | `_` XID_Continue+ > > RAW_IDENTIFIER : `r#` IDENTIFIER_OR_KEYWORD *Except `crate`, `self`, `super`, `Self`* > @@ -12,29 +12,59 @@ > IDENTIFIER :\ > NON_KEYWORD_IDENTIFIER | RAW_IDENTIFIER -An identifier is any nonempty Unicode string of the following form: + +Identifiers follow the specification in [Unicode Standard Annex #31][UAX31] for Unicode version 13.0, with the additions described below. Some examples of identifiers: -Either +* `foo` +* `_identifier` +* `r#true` +* `Москва` +* `東京` -* The first character has property [`XID_start`]. -* The remaining characters have property [`XID_continue`]. +The profile used from UAX #31 is: -Or +* Start := [`XID_Start`], plus the underscore character (U+005F) +* Continue := [`XID_Continue`] +* Medial := empty -* The first character is `_`. -* The identifier is more than one character. `_` alone is not an identifier. -* The remaining characters have property [`XID_continue`]. +> **Note**: Identifiers starting with an underscore are typically used to indicate an identifier that is intentionally unused, and will silence the unused warning in `rustc`. -> **Note**: [`XID_start`] and [`XID_continue`] as character properties cover the -> character ranges used to form the more familiar C and Java language-family -> identifiers. +Identifiers may not be a [strict] or [reserved] keyword without the `r#` prefix described below in [raw identifiers](#raw-identifiers). + +Zero width non-joiner (ZWNJ U+200C) and zero width joiner (ZWJ U+200D) characters are not allowed in identifiers. + +Identifiers are restricted to the ASCII subset of [`XID_Start`] and [`XID_Continue`] in the following situations: + +* [`extern crate`] declarations +* External crate names referenced in a [path] +* [Module] names loaded from the filesystem without a [`path` attribute] +* [`no_mangle`] attributed items +* Item names in [external blocks] + +## Normalization + +Identifiers are normalized using Normalization Form C (NFC) as defined in [Unicode Standard Annex #15][UAX15]. Two identifiers are equal if their NFC forms are equal. + +[Procedural][proc-macro] and [declarative][mbe] macros receive normalized identifiers in their input. + +## Raw identifiers A raw identifier is like a normal identifier, but prefixed by `r#`. (Note that the `r#` prefix is not included as part of the actual identifier.) Unlike a normal identifier, a raw identifier may be any strict or reserved keyword except the ones listed above for `RAW_IDENTIFIER`. -[strict]: keywords.md#strict-keywords +[`extern crate`]: items/extern-crates.md +[`no_mangle`]: abi.md#the-no_mangle-attribute +[`path` attribute]: items/modules.md#the-path-attribute +[`XID_Continue`]: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AXID_Continue%3A%5D&abb=on&g=&i= +[`XID_Start`]: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AXID_Start%3A%5D&abb=on&g=&i= +[external blocks]: items/external-blocks.md +[mbe]: macros-by-example.md +[module]: items/modules.md +[path]: paths.md +[proc-macro]: procedural-macros.md [reserved]: keywords.md#reserved-keywords -[`XID_start`]: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AXID_Start%3A%5D&abb=on&g=&i= -[`XID_continue`]: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AXID_Continue%3A%5D&abb=on&g=&i= +[strict]: keywords.md#strict-keywords +[UAX15]: https://www.unicode.org/reports/tr15/tr15-50.html +[UAX31]: https://www.unicode.org/reports/tr31/tr31-33.html