Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Fuzz targets #262

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

Add Fuzz targets #262

wants to merge 22 commits into from

Conversation

pcwizz
Copy link

@pcwizz pcwizz commented Feb 26, 2025

Adds a selection of fuzz targets for jaq core. Lexer types are made public to enable direct fuzzing of the parser.

@01mf02
Copy link
Owner

01mf02 commented Mar 4, 2025

Thanks a lot for your work, @pcwizz!

use libfuzzer_sys::fuzz_target;

fuzz_target!(|code: &str| {
if code.contains("\"") || code.contains("main") {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this, especially the "main" part?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some shallow crashes on intputs with multiple mains for instance. This bit is just to filter the inputs so I could look for deeper crashes.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. However, in jq, having multiple mains is perfectly valid: For example def main: 1; def main: 2; main is OK and returns 2. Also, I find regrettable that filtering out any quotes will not fuzz any of the string parsing.

Suppose that we would remove the if code.contains ... line. How can we evaluate the effectiveness of that change? In other words, how can we know whether this line helps or prevents looking for deeper crashes? Do you measure some kind of code coverage?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can generate coverage reports: https://rust-fuzz.github.io/book/cargo-fuzz/coverage.html#generate-code-coverage-data

This should tell you which code paths are being hit.

I was quite tempted to remove those filters myself before committing, but decided instead to commit it as used. My general rule of thumb is that if I get in the cycle of hitting the same crash a couple of times then I tend to adjust the target to avoid it.

@01mf02
Copy link
Owner

01mf02 commented Mar 4, 2025

I have now implemented Arbitrary manually for token types. This should prevent the generation of many tokens that the lexer would never output, and which the parser therefore is not supposed handle. I will also document these invariants in the type definitions of Token and StrPart.

I noticed that the fuzzer now prints much less output. I would be very grateful if you check whether this is OK or whether I screwed up the Arbitrary implementation.

Also, I had a bit of a problem to generate a non-empty &str (in Token::arbitrary). I now just fail when a generated &str is empty, but this is a bit suboptimal, because it wastes computation power. If you have an idea how to do this better, I'm all ears!

@pcwizz
Copy link
Author

pcwizz commented Mar 4, 2025

I have now implemented Arbitrary manually for token types. This should prevent the generation of many tokens that the lexer would never output, and which the parser therefore is not supposed handle. I will also document these invariants in the type definitions of Token and StrPart.

I noticed that the fuzzer now prints much less output. I would be very grateful if you check whether this is OK or whether I screwed up the Arbitrary implementation.

Also, I had a bit of a problem to generate a non-empty &str (in Token::arbitrary). I now just fail when a generated &str is empty, but this is a bit suboptimal, because it wastes computation power. If you have an idea how to do this better, I'm all ears!

Sure I will take a look.

@01mf02
Copy link
Owner

01mf02 commented Mar 4, 2025

Inspired by your comment:

You can add your own inputs here (e.g. valid jaq programs) as a starting corpus to bootstrap your fuzzing efforts deeper in jaq's core.

I just wrote a script that makes a nice initial fuzzing corpus.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as long as the Parser's interface documentation states that illegal tokens will result in panics then it is reasonable to narrow the fuzzing inputs to what should be valid tokens.

Implementing arbitrary directly in the module is certainly a lot less boiler plate. If you are happy with including the arbitrary dependency then this is a nice thing to do.

@@ -1,5 +1,7 @@
//! Combined file loading, lexing, and parsing for multiple modules.

#[cfg(feature = "arbitrary")]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha nice that will also solve the dependency question.

@pcwizz
Copy link
Author

pcwizz commented Mar 6, 2025

I have now implemented Arbitrary manually for token types. This should prevent the generation of many tokens that the lexer would never output, and which the parser therefore is not supposed handle. I will also document these invariants in the type definitions of Token and StrPart.

I noticed that the fuzzer now prints much less output. I would be very grateful if you check whether this is OK or whether I screwed up the Arbitrary implementation.

Also, I had a bit of a problem to generate a non-empty &str (in Token::arbitrary). I now just fail when a generated &str is empty, but this is a bit suboptimal, because it wastes computation power. If you have an idea how to do this better, I'm all ears!

If you can discard the input as quickly as possible then the fuzzing engine will decide it is uninteresting relatively quickly. To some extend wasting some time generating invalid inputs is always going to be part of the game. Having a good starting corpus will save you some of this waste on startup. The most important thing is that it is reproducible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants