-
Notifications
You must be signed in to change notification settings - Fork 0
language-rust
lexer rejects Unicode symbols that rustc
accepts
#3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Some assorted notes that I took while investigating this:
|
The previous lexer implementation in `Language.Rust.Parser.Lexer` was broken for Unicode characters with sufficiently large codepoints, as the previous implementation incorrectly attempted to port UTF-16–encoded codepoints over to `alex`, which is UTF-8–encoded. Rather than try to fix the previous implementation (which was based on old `rustc` code that is no longer used), this ports the lexer to a new implementation that is based on the Rust `unicode-xid` crate (which is how modern versions of `rustc` lex Unicode characters). Specifically: * This adapts `unicode-xid`'s lexer generation script to generate an `alex`-based lexer instead of a Rust-based one. * The new lexer is generated to support codepoints from Unicode 15.1.0. (It is unclear which exact Unicode version the previous lexer targeted, but given that it was last updated in 2016, it was likely quite an old version.) * I have verified that the new lexer can lex exotic Unicode characters such as `𝑂` and `𐌝` by adding them as regression tests. Fixes #3.
Per the Rust Reference, Rust permits any identifier that meets the specification in Unicode Standard Annex #31 for Unicode version 15.0. For example,
rustc
accepts the following program:language-rust
, on the other hand, fails to lex this program:My guess is that this part of the lexer needs to be updated to support Unicode 15.0.
The text was updated successfully, but these errors were encountered: