Correctly process overlapping unblamed hunks #1983

cruessler · 2025-05-03T11:26:48Z

Provide more context in assertion
Correctly process overlapping unblamed hunks
Use Blamed File and Source File more consistently

This PR finally, I hope!, fixes the last remaining larger issue in the blame algorithm.

Previously, process_changes assumed that UnblamedHunks would never overlap and produced incorrect results if they did. This constraint has been lifted.

As a consequence, the high-level assumption that two different lines from a Blamed File can never map to the same line in a Source File has been lifted as well. There is not really a way to enforce this constraint in the blame algorithm alone as it heavily depends on the algorithm used for diffing.

Running gix blame CREDITS in the Linux kernel, I was not able to detect any meaningful changes between this PR and main in terms of performance. process_changes barely shows up in the profile (at about 0.1 % in both cases), so the impact seems to be negligible.

Fixes #1847.

Previously, `process_changes` assumed that `UnblamedHunk`s would never overlap. This constraint has been lifted. As a consequence, the high-level assumption that two different lines from a *Blamed File* can never map to the same line in a *Source File* has been lifted as well. There is not really a way to enforce this constraint in the blame algorithm alone as it heavily depends on the algorithm used for diffing.

This change also reflects the way `blamed_file` and `source_file` are used throughout the code.

Byron · 2025-05-03T12:54:05Z

This is fantastic, thanks so much :)!

It took me a moment to understand, but I realise now how the slider problem could ambiguously assign ownership of a hunk to multiple commits at the same time. In that case, I assume the algorithm picks the first (i.e. earliest) one it encounters?

In any case, I will take a closer look soon to get this merged quickly.

PS: In theory, the biggest remaining correctness issue seems to be the lack of rename tracking. The rename-tracking part is easy to enable as part of the diff, it's just the algorithm which then has to be able to follow possibly multiple names through the graph at the same time. For instance, main has file, but it was created from merging A and B and both renamed a -> file and b -> file while resolving merge conflicts manually. Then to find all hunks one would probably have to look for file a in branch A and file b in branch B.
So that part is probably less easy.

cruessler · 2025-05-03T14:22:04Z

What could happen was a situation like the following which is an excerpt of the algorithm’s internal state when running cargo run -- blame Cargo.lock -L 3620,3660 in gitoxide.

hunks_to_blame = [
  UnblamedHunk {
    range_in_blamed_file: 3624..3627,
    suspects: [
      (Sha1(ca5b80174b73cc9ac162b3f33b5d3721ef936cb1), 942..945)
    ]
  },
  UnblamedHunk {
    range_in_blamed_file: 3643..3646,
    suspects: [
      (Sha1(ca5b80174b73cc9ac162b3f33b5d3721ef936cb1), 942..945)
    ]
  }
];

changes = [
  AddedOrReplaced(933..943, 0)
];

As you can see, the lines 942..945 appear twice in suspects, but differ with respect to range_in_blamed_file. These lines contain something like ]\n\n[package]\n which appears more than once in the file and which can be traced to more than one source. This situation can happen with merge commits where hunks take different paths to arrive at the same suspect.

So it is not that a hunk has more than one suspect, but that two hunks have overlapping or even identical suspects.

The previous version of process_changes would first process the first UnblamedHunk and then produce an incorrect result for the second one. In this example, both have to be matched against AddedOrReplaced(933..943, 0), not just the first one. The new version of process_changes matches each UnblamedHunk individually against all changes, so whether or not there are any overlaps does not matter (also, whether the hunks are sorted does not matter either).

Does that help?

And thanks for bringing up rename tracking! I hadn’t filed that under correctness issues, but you’re right, this should definitely be part of gix blame! 😄

all just minor tweaks.

Byron

Thanks again, I think this absolutely good to go!
I did minor refactors, and hope they are an improvement im your eyes as well.

Does that help?

Even though I think it should and probably would if I wasn't so schwer von Begriff, I admit to not have a model in my head that would explain the difference between how both algorithms function.

However, when looking at the new version of the processing loop it is clear how changes can now be re-used, something that previously wasn't possible.

And thanks for bringing up rename tracking! I hadn’t filed that under correctness issues, but you’re right, this should definitely be part of gix blame! 😄

That's true, it's probably not correctness in the sense that this PR is correctness, it's more like a shortcoming.
If one can think of the graph traversal having 'cursors' that move along multiple branches in the graph in parallel, then if each cursor would have a filename (the name of the blamed file as it is known currently), then maybe this isn't even this hard.
Somehow I feel that I should implement it just to finally learn (or get a feeling) of how all this truly works.

Byron · 2025-05-04T18:37:26Z

gix-blame/src/lib.rs

 //! * **Blamed File**
+//!    - The file as it exists in `HEAD`.
+//!    - the initial state with all lines that we need to associate with a *Blamed File*.


Suggested change

//! - the initial state with all lines that we need to associate with a *Blamed File*.

//! - the initial state with all lines that we need to associate with a *Source File*.

That would make more sense to me, even though the previous version of this was also self-referential.
I made the change assuming it's correct, but if not let's fix it in a follow-up.

cruessler added 3 commits May 3, 2025 12:27

Provide more context in assertion

d46766a

Use *Blamed File* and *Source File* more consistently

2f6786b

This change also reflects the way `blamed_file` and `source_file` are used throughout the code.

Thanks clippy

ee6f5cc

refactor

b2121bc

all just minor tweaks.

Byron approved these changes May 4, 2025

View reviewed changes

Byron enabled auto-merge May 4, 2025 19:06

Byron merged commit 83e1b73 into GitoxideLabs:main May 4, 2025
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly process overlapping unblamed hunks #1983

Correctly process overlapping unblamed hunks #1983

cruessler commented May 3, 2025 •

edited

Loading

Byron commented May 3, 2025 •

edited

Loading

cruessler commented May 3, 2025 •

edited

Loading

Byron left a comment

Byron May 4, 2025

	//! - the initial state with all lines that we need to associate with a Blamed File.
	//! - the initial state with all lines that we need to associate with a Source File.

Correctly process overlapping unblamed hunks #1983

Correctly process overlapping unblamed hunks #1983

Conversation

cruessler commented May 3, 2025 • edited Loading

Byron commented May 3, 2025 • edited Loading

cruessler commented May 3, 2025 • edited Loading

Byron left a comment

Choose a reason for hiding this comment

Byron May 4, 2025

Choose a reason for hiding this comment

cruessler commented May 3, 2025 •

edited

Loading

Byron commented May 3, 2025 •

edited

Loading

cruessler commented May 3, 2025 •

edited

Loading