Skip to content

Correctly process overlapping unblamed hunks #1983

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

cruessler
Copy link
Contributor

@cruessler cruessler commented May 3, 2025

  • Provide more context in assertion
  • Correctly process overlapping unblamed hunks
  • Use Blamed File and Source File more consistently

This PR finally, I hope!, fixes the last remaining larger issue in the blame algorithm.

Previously, process_changes assumed that UnblamedHunks would never overlap and produced incorrect results if they did. This constraint has been lifted.

As a consequence, the high-level assumption that two different lines from a Blamed File can never map to the same line in a Source File has been lifted as well. There is not really a way to enforce this constraint in the blame algorithm alone as it heavily depends on the algorithm used for diffing.

Running gix blame CREDITS in the Linux kernel, I was not able to detect any meaningful changes between this PR and main in terms of performance. process_changes barely shows up in the profile (at about 0.1 % in both cases), so the impact seems to be negligible.

Fixes #1847.

cruessler added 3 commits May 3, 2025 12:27
Previously, `process_changes` assumed that `UnblamedHunk`s would never
overlap. This constraint has been lifted.

As a consequence, the high-level assumption that two different lines
from a *Blamed File* can never map to the same line in a *Source File*
has been lifted as well. There is not really a way to enforce this
constraint in the blame algorithm alone as it heavily depends on the
algorithm used for diffing.
This change also reflects the way `blamed_file` and `source_file` are
used throughout the code.
@Byron
Copy link
Member

Byron commented May 3, 2025

This is fantastic, thanks so much :)!

It took me a moment to understand, but I realise now how the slider problem could ambiguously assign ownership of a hunk to multiple commits at the same time. In that case, I assume the algorithm picks the first (i.e. earliest) one it encounters?

In any case, I will take a closer look soon to get this merged quickly.

PS: In theory, the biggest remaining correctness issue seems to be the lack of rename tracking. The rename-tracking part is easy to enable as part of the diff, it's just the algorithm which then has to be able to follow possibly multiple names through the graph at the same time. For instance, main has file, but it was created from merging A and B and both renamed a -> file and b -> file while resolving merge conflicts manually. Then to find all hunks one would probably have to look for file a in branch A and file b in branch B.
So that part is probably less easy.

@cruessler
Copy link
Contributor Author

cruessler commented May 3, 2025

What could happen was a situation like the following which is an excerpt of the algorithm’s internal state when running cargo run -- blame Cargo.lock -L 3620,3660 in gitoxide.

hunks_to_blame = [
  UnblamedHunk {
    range_in_blamed_file: 3624..3627,
    suspects: [
      (Sha1(ca5b80174b73cc9ac162b3f33b5d3721ef936cb1), 942..945)
    ]
  },
  UnblamedHunk {
    range_in_blamed_file: 3643..3646,
    suspects: [
      (Sha1(ca5b80174b73cc9ac162b3f33b5d3721ef936cb1), 942..945)
    ]
  }
];

changes = [
  AddedOrReplaced(933..943, 0)
];

As you can see, the lines 942..945 appear twice in suspects, but differ with respect to range_in_blamed_file. These lines contain something like ]\n\n[package]\n which appears more than once in the file and which can be traced to more than one source. This situation can happen with merge commits where hunks take different paths to arrive at the same suspect.

So it is not that a hunk has more than one suspect, but that two hunks have overlapping or even identical suspects.

The previous version of process_changes would first process the first UnblamedHunk and then produce an incorrect result for the second one. In this example, both have to be matched against AddedOrReplaced(933..943, 0), not just the first one. The new version of process_changes matches each UnblamedHunk individually against all changes, so whether or not there are any overlaps does not matter (also, whether the hunks are sorted does not matter either).

Does that help?

And thanks for bringing up rename tracking! I hadn’t filed that under correctness issues, but you’re right, this should definitely be part of gix blame! 😄

all just minor tweaks.
Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again, I think this absolutely good to go!
I did minor refactors, and hope they are an improvement im your eyes as well.

Does that help?

Even though I think it should and probably would if I wasn't so schwer von Begriff, I admit to not have a model in my head that would explain the difference between how both algorithms function.

However, when looking at the new version of the processing loop it is clear how changes can now be re-used, something that previously wasn't possible.

And thanks for bringing up rename tracking! I hadn’t filed that under correctness issues, but you’re right, this should definitely be part of gix blame! 😄

That's true, it's probably not correctness in the sense that this PR is correctness, it's more like a shortcoming.
If one can think of the graph traversal having 'cursors' that move along multiple branches in the graph in parallel, then if each cursor would have a filename (the name of the blamed file as it is known currently), then maybe this isn't even this hard.
Somehow I feel that I should implement it just to finally learn (or get a feeling) of how all this truly works.

//! * **Blamed File**
//! - The file as it exists in `HEAD`.
//! - the initial state with all lines that we need to associate with a *Blamed File*.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! - the initial state with all lines that we need to associate with a *Blamed File*.
//! - the initial state with all lines that we need to associate with a *Source File*.

That would make more sense to me, even though the previous version of this was also self-referential.
I made the change assuming it's correct, but if not let's fix it in a follow-up.

@Byron Byron enabled auto-merge May 4, 2025 19:06
@Byron Byron merged commit 83e1b73 into GitoxideLabs:main May 4, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Debug Assertions Triggered In gix-blame
2 participants