-
-
Notifications
You must be signed in to change notification settings - Fork 344
Correctly process overlapping unblamed hunks #1983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly process overlapping unblamed hunks #1983
Conversation
Previously, `process_changes` assumed that `UnblamedHunk`s would never overlap. This constraint has been lifted. As a consequence, the high-level assumption that two different lines from a *Blamed File* can never map to the same line in a *Source File* has been lifted as well. There is not really a way to enforce this constraint in the blame algorithm alone as it heavily depends on the algorithm used for diffing.
This change also reflects the way `blamed_file` and `source_file` are used throughout the code.
This is fantastic, thanks so much :)! It took me a moment to understand, but I realise now how the slider problem could ambiguously assign ownership of a hunk to multiple commits at the same time. In that case, I assume the algorithm picks the first (i.e. earliest) one it encounters? In any case, I will take a closer look soon to get this merged quickly. PS: In theory, the biggest remaining correctness issue seems to be the lack of rename tracking. The rename-tracking part is easy to enable as part of the diff, it's just the algorithm which then has to be able to follow possibly multiple names through the graph at the same time. For instance, |
What could happen was a situation like the following which is an excerpt of the algorithm’s internal state when running
As you can see, the lines So it is not that a hunk has more than one suspect, but that two hunks have overlapping or even identical suspects. The previous version of Does that help? And thanks for bringing up rename tracking! I hadn’t filed that under correctness issues, but you’re right, this should definitely be part of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again, I think this absolutely good to go!
I did minor refactors, and hope they are an improvement im your eyes as well.
Does that help?
Even though I think it should and probably would if I wasn't so schwer von Begriff, I admit to not have a model in my head that would explain the difference between how both algorithms function.
However, when looking at the new version of the processing loop it is clear how changes
can now be re-used, something that previously wasn't possible.
And thanks for bringing up rename tracking! I hadn’t filed that under correctness issues, but you’re right, this should definitely be part of gix blame! 😄
That's true, it's probably not correctness in the sense that this PR is correctness, it's more like a shortcoming.
If one can think of the graph traversal having 'cursors' that move along multiple branches in the graph in parallel, then if each cursor would have a filename (the name of the blamed file as it is known currently), then maybe this isn't even this hard.
Somehow I feel that I should implement it just to finally learn (or get a feeling) of how all this truly works.
gix-blame/src/lib.rs
Outdated
//! * **Blamed File** | ||
//! - The file as it exists in `HEAD`. | ||
//! - the initial state with all lines that we need to associate with a *Blamed File*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//! - the initial state with all lines that we need to associate with a *Blamed File*. | |
//! - the initial state with all lines that we need to associate with a *Source File*. |
That would make more sense to me, even though the previous version of this was also self-referential.
I made the change assuming it's correct, but if not let's fix it in a follow-up.
This PR finally, I hope!, fixes the last remaining larger issue in the blame algorithm.
Previously,
process_changes
assumed thatUnblamedHunk
s would never overlap and produced incorrect results if they did. This constraint has been lifted.As a consequence, the high-level assumption that two different lines from a Blamed File can never map to the same line in a Source File has been lifted as well. There is not really a way to enforce this constraint in the blame algorithm alone as it heavily depends on the algorithm used for diffing.
Running
gix blame CREDITS
in the Linux kernel, I was not able to detect any meaningful changes between this PR andmain
in terms of performance.process_changes
barely shows up in the profile (at about 0.1 % in both cases), so the impact seems to be negligible.Fixes #1847.