rewrite the regex crate #978

BurntSushi · 2023-04-18T18:52:44Z

This effectively replaces the existing regex crate internals with public APIs (that anyone can use) from regex-automata.

I'll fill our this PR a little bit more, but I don't intend to merge this for a bit until current master has been released. I have this up in a PR primarily to make sure CI is working as intended, and in particular, to ensure that the 1.8 release is actually sufficient preparation for the 1.9 release. (It would be bad to put out 1.8 and then realize I need another breaking change in regex-syntax for example in order to wire everything up.)

For those who want to browse the regex-automata API, you can do so here (top-level docs haven't been written yet): https://burntsushi.net/stuff/tmp-do-not-link-me/regex/regex_automata/

lopopolo · 2023-04-21T16:01:49Z

@BurntSushi I'm not sure if this is something you're interested in, but using git subtree you can import regex-automata and all of its history into the regex repository: https://manpages.ubuntu.com/manpages/bionic/man1/git-subtree.1.html

BurntSushi · 2023-04-21T16:03:30Z

@lopopolo Aye thanks for that idea. I'm going to pass for now primarily because the regex-automata I'm importing here is basically a total rewrite. It doesn't really have much meaningful history going back to 0.1. It's a nearly totally different thing.

We are going to remove the old benchmark harness, but it seems like a good idea to save the old measurements. In the future, benchmarks will be maintained by rebar: https://github.com/BurntSushi/rebar

As stated in a previous commit, we'll be moving to rebar. (rebar isn't actually published at time of writing, but it's essentially ready to go.)

Closes #961

It's still not as good as it could be, but we add fuzz targets for regex-lite and DFA deserialization in regex-automata.

This feature makes all of the AST types derive the 'Arbitrary' trait, which is in turn quite useful for fuzz testing.

Basically, whenever a counted repetition is applied to a sub-expression that can only ever match the empty string, the counted repetition can be reduced to 1. We can achieve that optimization very easily via the Hir::repetition smart constructor. This is somewhat important to do because otherwise one can write something like \B{10000}. The higher level infrastructure is somewhat dumb about this and will happily try to match \B over and over again. We should probably improve the higher level aspects of this (because this is not the only case that can cause the same assertions being repeatedly evaluated at the same position), but this fixes the most obvious ones at the HIR level.

This makes a couple of the fuzzer targets a bit nicer by just asking for structured data instead of trying to manifest it ourselves out of a &[u8]. Closes #821

The fuzzer sometimes runs into situations where it builds regexes that can take a while to execute, such as `\B{10000}`. They fit within the default size limit, but the search times aren't great. But it's not a bug. So try to decrease the size limit a bit to try and prevent timeouts. We might consider trying to optimize cases like `\B{10000}`. A naive optimization would be to remove any redundant conditional epsilon transitions within a single epsilon closure, but that can be tricky to do a priori. The case of `\B{100000}` is probably easy to detect, but they can be arbitrarily complex. Another way to attack this would be to modify, say, the PikeVM to only compute whether a conditional epsilon transition should be followed once per haystack position. Right now, I think it is re-computing them even though it doesn't have to.

This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make fuzzing much more targeted and complete. Closes #848

It turns out that the way we were dealing with quit states in the DFA was not quite right. Basically, if we entered a quit state and a match had been found, then we were returning the match instead of the error. But the match might not be the correct leftmost-first match, and so, we really shouldn't return it. Otherwise a regex like '\B.*' could match much less than it should. This was caught by a differential fuzzer developed in #848.

This adds a regression test for a bug found in the *old* regex crate that isn't present with the regex-automata rewrite. I discovered this while doing differential fuzzing. I didn't do a root cause analysis of the bug, but my guess is a literal optimization problem.

This makes it a little easier to introspect what the regex crate is doing. Just pass RUST_LOG=debug to get a general sense of things, and RUST_LOG=trace to get a lot more output.

... and add some more fuzz testing based on it. Closes #991

This commit grew into a monster. I ran out of energy trying to split everything up. For the most part, this commit is about polishing and writing docs.

I usually close tickets on a commit-by-commit basis, but this refactor was so big that it wasn't feasible to do that. So ticket closures are marked here. Closes #244 Closes #259 Closes #476 Closes #644 Closes #675 Closes #824 Closes #961 Closes #68 Closes #510 Closes #787 Closes #891 Closes #429 Closes #517 Closes #579 Closes #779 Closes #850 Closes #921 Closes #976 Closes #1002 Closes #656

BurntSushi force-pushed the ag/regex-automata branch 22 times, most recently from b534e18 to 0eb9aee Compare April 21, 2023 00:48

BurntSushi force-pushed the ag/regex-automata branch from 0eb9aee to 4aa13d0 Compare April 21, 2023 16:02

BurntSushi force-pushed the ag/regex-automata branch 5 times, most recently from d12ba51 to 93f3f6e Compare April 27, 2023 13:52

BurntSushi and others added 13 commits June 5, 2023 14:03

bench: move the old recordings to 'record' directory

e9ffe3f

We are going to remove the old benchmark harness, but it seems like a good idea to save the old measurements. In the future, benchmarks will be maintained by rebar: https://github.com/BurntSushi/rebar

bench: remove the old harness

3cc7a3a

As stated in a previous commit, we'll be moving to rebar. (rebar isn't actually published at time of writing, but it's essentially ready to go.)

api: introduce new regex-lite crate

75366f9

Closes #961

fuzz: improve fuzz testing

7cf6102

It's still not as good as it could be, but we add fuzz targets for regex-lite and DFA deserialization in regex-automata.

syntax: add new 'arbitrary' crate feature

03eb47f

This feature makes all of the AST types derive the 'Arbitrary' trait, which is in turn quite useful for fuzz testing.

fuzz: use structured fuzzer input

05b50be

This makes a couple of the fuzzer targets a bit nicer by just asking for structured data instead of trying to manifest it ourselves out of a &[u8]. Closes #821

fuzz: add syntactic structurally aware fuzzers

bc1f1af

This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make fuzzing much more targeted and complete. Closes #848

fuzz: add logging support to the fuzzer

4d4f011

This makes it a little easier to introspect what the regex crate is doing. Just pass RUST_LOG=debug to get a general sense of things, and RUST_LOG=trace to get a lot more output.

fuzz: improve Arbitrary impl for Unicode classes

e34b914

... and add some more fuzz testing based on it. Closes #991

BurntSushi force-pushed the ag/regex-automata branch 3 times, most recently from 893c7cd to b4a8948 Compare June 20, 2023 01:36

BurntSushi force-pushed the ag/regex-automata branch 7 times, most recently from b34bf39 to 73e45ea Compare July 4, 2023 22:27

BurntSushi added 2 commits July 4, 2023 20:45

*: lots of polish

e96bf8f

This commit grew into a monster. I ran out of energy trying to split everything up. For the most part, this commit is about polishing and writing docs.

BurntSushi force-pushed the ag/regex-automata branch from 73e45ea to 8513751 Compare July 5, 2023 01:11

BurntSushi merged commit aa64e6d into master Jul 5, 2023

utkarshgupta137 mentioned this pull request Jul 5, 2023

Won't compile after regex 1.9.0 alexliesenfeld/httpmock#84

Closed

BurntSushi deleted the ag/regex-automata branch July 5, 2023 12:54

addisoncrump mentioned this pull request Jul 12, 2023

fuzz_regex_match was not rolled back before merge #1037

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rewrite the regex crate #978

rewrite the regex crate #978

BurntSushi commented Apr 18, 2023 •

edited

Loading

lopopolo commented Apr 21, 2023

BurntSushi commented Apr 21, 2023

rewrite the regex crate #978

rewrite the regex crate #978

Conversation

BurntSushi commented Apr 18, 2023 • edited Loading

lopopolo commented Apr 21, 2023

BurntSushi commented Apr 21, 2023

BurntSushi commented Apr 18, 2023 •

edited

Loading