-
Notifications
You must be signed in to change notification settings - Fork 461
rewrite the regex crate #978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b534e18
to
0eb9aee
Compare
@BurntSushi I'm not sure if this is something you're interested in, but using |
0eb9aee
to
4aa13d0
Compare
@lopopolo Aye thanks for that idea. I'm going to pass for now primarily because the |
d12ba51
to
93f3f6e
Compare
We are going to remove the old benchmark harness, but it seems like a good idea to save the old measurements. In the future, benchmarks will be maintained by rebar: https://github.com/BurntSushi/rebar
As stated in a previous commit, we'll be moving to rebar. (rebar isn't actually published at time of writing, but it's essentially ready to go.)
It's still not as good as it could be, but we add fuzz targets for regex-lite and DFA deserialization in regex-automata.
This feature makes all of the AST types derive the 'Arbitrary' trait, which is in turn quite useful for fuzz testing.
Basically, whenever a counted repetition is applied to a sub-expression that can only ever match the empty string, the counted repetition can be reduced to 1. We can achieve that optimization very easily via the Hir::repetition smart constructor. This is somewhat important to do because otherwise one can write something like \B{10000}. The higher level infrastructure is somewhat dumb about this and will happily try to match \B over and over again. We should probably improve the higher level aspects of this (because this is not the only case that can cause the same assertions being repeatedly evaluated at the same position), but this fixes the most obvious ones at the HIR level.
This makes a couple of the fuzzer targets a bit nicer by just asking for structured data instead of trying to manifest it ourselves out of a &[u8]. Closes #821
The fuzzer sometimes runs into situations where it builds regexes that can take a while to execute, such as `\B{10000}`. They fit within the default size limit, but the search times aren't great. But it's not a bug. So try to decrease the size limit a bit to try and prevent timeouts. We might consider trying to optimize cases like `\B{10000}`. A naive optimization would be to remove any redundant conditional epsilon transitions within a single epsilon closure, but that can be tricky to do a priori. The case of `\B{100000}` is probably easy to detect, but they can be arbitrarily complex. Another way to attack this would be to modify, say, the PikeVM to only compute whether a conditional epsilon transition should be followed once per haystack position. Right now, I think it is re-computing them even though it doesn't have to.
This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make fuzzing much more targeted and complete. Closes #848
It turns out that the way we were dealing with quit states in the DFA was not quite right. Basically, if we entered a quit state and a match had been found, then we were returning the match instead of the error. But the match might not be the correct leftmost-first match, and so, we really shouldn't return it. Otherwise a regex like '\B.*' could match much less than it should. This was caught by a differential fuzzer developed in #848.
This adds a regression test for a bug found in the *old* regex crate that isn't present with the regex-automata rewrite. I discovered this while doing differential fuzzing. I didn't do a root cause analysis of the bug, but my guess is a literal optimization problem.
This makes it a little easier to introspect what the regex crate is doing. Just pass RUST_LOG=debug to get a general sense of things, and RUST_LOG=trace to get a lot more output.
... and add some more fuzz testing based on it. Closes #991
893c7cd
to
b4a8948
Compare
b34bf39
to
73e45ea
Compare
This commit grew into a monster. I ran out of energy trying to split everything up. For the most part, this commit is about polishing and writing docs.
I usually close tickets on a commit-by-commit basis, but this refactor was so big that it wasn't feasible to do that. So ticket closures are marked here. Closes #244 Closes #259 Closes #476 Closes #644 Closes #675 Closes #824 Closes #961 Closes #68 Closes #510 Closes #787 Closes #891 Closes #429 Closes #517 Closes #579 Closes #779 Closes #850 Closes #921 Closes #976 Closes #1002 Closes #656
73e45ea
to
8513751
Compare
This effectively replaces the existing
regex
crate internals with public APIs (that anyone can use) fromregex-automata
.I'll fill our this PR a little bit more, but I don't intend to merge this for a bit until current
master
has been released. I have this up in a PR primarily to make sure CI is working as intended, and in particular, to ensure that the 1.8 release is actually sufficient preparation for the 1.9 release. (It would be bad to put out 1.8 and then realize I need another breaking change inregex-syntax
for example in order to wire everything up.)For those who want to browse the
regex-automata
API, you can do so here (top-level docs haven't been written yet): https://burntsushi.net/stuff/tmp-do-not-link-me/regex/regex_automata/