Skip to content

language-rust lexer rejects Unicode symbols that rustc accepts #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
RyanGlScott opened this issue Aug 23, 2024 · 1 comment · Fixed by #4
Closed

language-rust lexer rejects Unicode symbols that rustc accepts #3

RyanGlScott opened this issue Aug 23, 2024 · 1 comment · Fixed by #4
Assignees
Labels
bug Something isn't working

Comments

@RyanGlScott
Copy link

Per the Rust Reference, Rust permits any identifier that meets the specification in Unicode Standard Annex #31 for Unicode version 15.0. For example, rustc accepts the following program:

// test.rs
fn main() {
    let 𝑂_𝑂 = ();
    𝑂_𝑂
}

language-rust, on the other hand, fails to lex this program:

-- Main.hs
module Main (main) where

import Language.Rust.Data.InputStream
import Language.Rust.Parser
import Language.Rust.Syntax

main :: IO ()
main = do
  is <- readInputStream "test.rs"
  print $ parse @(SourceFile Span) is
$ runghc Main.hs
Left (parse failure at 3:9 (lexical error))

My guess is that this part of the lexer needs to be updated to support Unicode 15.0.

@RyanGlScott RyanGlScott added the bug Something isn't working label Aug 23, 2024
@RyanGlScott
Copy link
Author

Some assorted notes that I took while investigating this:

  • language-rust's lexer implementation is based off the work in Model lexer: Fix remaining issues rust-lang/rust#24620, which uses an ANTLR-based Unicode lexer.

  • It's hard to tell what version of Unicode this was based on, but this comment suggests it is around Unicode 4.0 or so.

  • language-rust copied over the ANTLR-based lexer tables directly into its own lexer implementation. The thing is, I'm not entirely convinced that it did so correctly. This is because the ANTLR-based tables encode Unicode characters using a UTF-16, but language-rust's lexer is generated from alex, which encodes Unicode characters using UTF-8. For sufficiently small character codepoints, these encodings coincide, but for larger codepoints, these are not the same.

    As a specific example where this goes wrong, consider the 𐌝 character, which uses the 0x1031D codepoint. In UTF-8, this is encoded with the surrogate pair (0xD800, 0xDF1D), which should be covered by this line in language-rust's lexer. Despite this, language-rust is unable to lex this program:

    // test.rs
    fn main() {
        let 𐌝 = ();
        𐌝
    }
    $ runghc Main.hs 
    Left (parse failure at 3:9 (lexical error))
    

    As such, I think language-rust's lexer is broken for any Unicode character that requires surrogate pairs to encode in UTF-16—that is, any character whose codepoint exceeds the value 0xFFFF.

  • Modern versions of rustc no longer use the ANTLR-based lexer linked above, but instead use a completely different lexer implementation based on these tables (which, in turn, are derived from the data on the official Unicode website). Notably, these tables are not UTF-16–encoded, so they would be much easier to translate to an alex-based lexer.

    I propose that we rewrite language-rust's lexer to be in terms of the data from the Unicode website, similarly to how rustc's modern lexer works. The rustc lexer implementation generates its tables using this script, so perhaps we can adapt this script to generate alex code. Scripting this would also make it much more straightforward to upgrade Unicode versions in the future. (Currently, the script uses Unicode 15.1.0.)

RyanGlScott added a commit that referenced this issue Aug 28, 2024
The previous lexer implementation in `Language.Rust.Parser.Lexer` was broken
for Unicode characters with sufficiently large codepoints, as the previous
implementation incorrectly attempted to port UTF-16–encoded codepoints over to
`alex`, which is UTF-8–encoded. Rather than try to fix the previous
implementation (which was based on old `rustc` code that is no longer used),
this ports the lexer to a new implementation that is based on the Rust
`unicode-xid` crate (which is how modern versions of `rustc` lex Unicode
characters). Specifically:

* This adapts `unicode-xid`'s lexer generation script to generate an
  `alex`-based lexer instead of a Rust-based one.

* The new lexer is generated to support codepoints from Unicode 15.1.0.
  (It is unclear which exact Unicode version the previous lexer targeted, but
  given that it was last updated in 2016, it was likely quite an old version.)

* I have verified that the new lexer can lex exotic Unicode characters such as
  `𝑂` and `𐌝` by adding them as regression tests.

Fixes #3.
@RyanGlScott RyanGlScott self-assigned this Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant