`language-rust` lexer rejects Unicode symbols that `rustc` accepts #3

RyanGlScott · 2024-08-23T18:02:49Z

Per the Rust Reference, Rust permits any identifier that meets the specification in Unicode Standard Annex #31 for Unicode version 15.0. For example, rustc accepts the following program:

// test.rs
fn main() {
    let 𝑂_𝑂 = ();
    𝑂_𝑂
}

language-rust, on the other hand, fails to lex this program:

-- Main.hs
module Main (main) where

import Language.Rust.Data.InputStream
import Language.Rust.Parser
import Language.Rust.Syntax

main :: IO ()
main = do
  is <- readInputStream "test.rs"
  print $ parse @(SourceFile Span) is

$ runghc Main.hs
Left (parse failure at 3:9 (lexical error))

My guess is that this part of the lexer needs to be updated to support Unicode 15.0.

The text was updated successfully, but these errors were encountered:

RyanGlScott · 2024-08-27T22:59:27Z

Some assorted notes that I took while investigating this:

language-rust's lexer implementation is based off the work in Model lexer: Fix remaining issues rust-lang/rust#24620, which uses an ANTLR-based Unicode lexer.
It's hard to tell what version of Unicode this was based on, but this comment suggests it is around Unicode 4.0 or so.
language-rust copied over the ANTLR-based lexer tables directly into its own lexer implementation. The thing is, I'm not entirely convinced that it did so correctly. This is because the ANTLR-based tables encode Unicode characters using a UTF-16, but language-rust's lexer is generated from alex, which encodes Unicode characters using UTF-8. For sufficiently small character codepoints, these encodings coincide, but for larger codepoints, these are not the same.

As a specific example where this goes wrong, consider the 𐌝 character, which uses the 0x1031D codepoint. In UTF-8, this is encoded with the surrogate pair (0xD800, 0xDF1D), which should be covered by this line in language-rust's lexer. Despite this, language-rust is unable to lex this program:
```
// test.rs
fn main() {
    let 𐌝 = ();
    𐌝
}
```
```
$ runghc Main.hs 
Left (parse failure at 3:9 (lexical error))
```
As such, I think language-rust's lexer is broken for any Unicode character that requires surrogate pairs to encode in UTF-16—that is, any character whose codepoint exceeds the value 0xFFFF.
Modern versions of rustc no longer use the ANTLR-based lexer linked above, but instead use a completely different lexer implementation based on these tables (which, in turn, are derived from the data on the official Unicode website). Notably, these tables are not UTF-16–encoded, so they would be much easier to translate to an alex-based lexer.

I propose that we rewrite language-rust's lexer to be in terms of the data from the Unicode website, similarly to how rustc's modern lexer works. The rustc lexer implementation generates its tables using this script, so perhaps we can adapt this script to generate alex code. Scripting this would also make it much more straightforward to upgrade Unicode versions in the future. (Currently, the script uses Unicode 15.1.0.)

The previous lexer implementation in `Language.Rust.Parser.Lexer` was broken for Unicode characters with sufficiently large codepoints, as the previous implementation incorrectly attempted to port UTF-16–encoded codepoints over to `alex`, which is UTF-8–encoded. Rather than try to fix the previous implementation (which was based on old `rustc` code that is no longer used), this ports the lexer to a new implementation that is based on the Rust `unicode-xid` crate (which is how modern versions of `rustc` lex Unicode characters). Specifically: * This adapts `unicode-xid`'s lexer generation script to generate an `alex`-based lexer instead of a Rust-based one. * The new lexer is generated to support codepoints from Unicode 15.1.0. (It is unclear which exact Unicode version the previous lexer targeted, but given that it was last updated in 2016, it was likely quite an old version.) * I have verified that the new lexer can lex exotic Unicode characters such as `𝑂` and `𐌝` by adding them as regression tests. Fixes #3.

RyanGlScott added the bug Something isn't working label Aug 23, 2024

RyanGlScott mentioned this issue Aug 28, 2024

Lexer: Properly support Unicode 15.1.0 #4

Merged

RyanGlScott self-assigned this Aug 29, 2024

RyanGlScott closed this as completed in #4 Sep 4, 2024

RyanGlScott closed this as completed in 86a6540 Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`language-rust` lexer rejects Unicode symbols that `rustc` accepts #3

`language-rust` lexer rejects Unicode symbols that `rustc` accepts #3

RyanGlScott commented Aug 23, 2024

RyanGlScott commented Aug 27, 2024

language-rust lexer rejects Unicode symbols that rustc accepts #3

language-rust lexer rejects Unicode symbols that rustc accepts #3

Comments

RyanGlScott commented Aug 23, 2024

RyanGlScott commented Aug 27, 2024

`language-rust` lexer rejects Unicode symbols that `rustc` accepts #3

`language-rust` lexer rejects Unicode symbols that `rustc` accepts #3