Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle interaction between gang blocks, copies, and FDT. #17123

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

pcd1193182
Copy link
Contributor

This change is built on top of #17073, because the changes to add the gang_copies property interact tightly with this change, in addition to the gang blocks test group.

Motivation and Context

See #17070. tl;dr: If you have ganging and multiple different values of the copies property while you use dedup, you can sometimes end up with BPs with a mix of gang and non-gang DVAs. That is illegal in ZFS: in debug mode, it'll trip an assertion when you're syncing the write out. In non-debug mode, it will make it to disk, but cause silent and confusing errors later when BP_IS_GANG is only partially correct.

Note that this change doesn't address cases 2 and 4 from #17070; those are trickier to deal with, and are also less problematic for users. Occasionally not getting as many copies as you asked for is a much less severe bug than silently invalid blkptrs. It also doesn't address the nopwrite case.

Description

This change forces updates of existing FDT entries to only add more DVAs of the same type; if there are already gang DVAs in the FDT, we require that the new allocations also be gangs. if there are non-gang DVAs in the FDT, then this allocation may not be gangs. If it would gang, we have to redo the whole write as a non-dedup write.

This change also contains a new test that covers the relevant cases. I verified that without the patch, it kernel panics when running the test.

How Has This Been Tested?

The added test, plus manual tests/

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not deeply familiar with ganging code, but I have doubts that zp_must_gang is viable. As I understand, to provide full redundancy we must not just provide some ganged pointer, but allocate all the gang children of exactly the same sizes and extend all existing gang header copies with those DVAs, which we can not do without rewriting them also. I think the only thing we can realistically do is prevent addition of copies writing and so dedup in general in that case.

@pcd1193182
Copy link
Contributor Author

As we discussed in the issue, the goal here isn't really to provide 100% of the redundancy the user asked for. That's challenging when you need to read in the entire gang tree to even determine what redundancy level is currently provided. The goal of this PR is just to prevent the creation of mixed gang/non-gang BPs, which are illegal in ZFS. That can be done even if the two gang trees have different depths/layouts.

@amotin
Copy link
Member

amotin commented Mar 10, 2025

That can be done even if the two gang trees have different depths/layouts.

No, I think it can not. You can not have in one block pointer 3 DVAs with non-identical content. zio_gang_tree_assemble() will read only one of them, that in your case won't include all the children. So some remaining will never be read or freed.

@pcd1193182
Copy link
Contributor Author

Hm, I see what you mean. It partially works for reads; if first level of gang headers is corrupt on the earlier DVAs, it'll read in the ones for the later DVAs, but not if anything else is wrong. And frees will free all the gang headers at the first level (because it has to free all the DVAs of each BP it knows about) but it won't know about the child BPs in any gang headers but the first one.

I think we could probably write additional copies of the gang header we already have in the BP? But I don't know if that's actually better than just not deduplicating the write.

@amotin
Copy link
Member

amotin commented Mar 10, 2025

I think we could probably write additional copies of the gang header we already have in the BP?

I don't think anything not overwriting the existing gang headers can be correct. And I don't think we can safely overwrite them. Not mentioning we'd have to allocate new block in exactly the same chunks, which would be quite a pain.

But I don't know if that's actually better than just not deduplicating the write.

That is why I proposed not deduplicating it. Not perfect, but a simple choice for a pretty rare scenario.

@amotin amotin added the Status: Revision Needed Changes are required for the PR to be accepted label Mar 11, 2025
@github-actions github-actions bot removed the Status: Revision Needed Changes are required for the PR to be accepted label Mar 11, 2025
Paul Dagnelie added 5 commits March 13, 2025 15:27
The redundant_metadata setting in ZFS allows users to trade resilience for
performance and space savings. This applies to all data and metadata blocks in
zfs, with one exception: gang blocks. Gang blocks currently just take the
copies property of the IO being ganged and, if it's 1, sets it to 2. This
means that we always make at least two copies of a gang header, which is good
for resilience. However, if the users care more about performance than
resilience, their gang blocks will be even more of a penalty than usual.

We add logic to calculate the number of gang headers copies directly, and
store it as a separate IO property. This is stored in the IO properties and
not calculated when we decide to gang because by that point we may not have
easy access to the relevant information about what kind of block is being
stored. We also check the redundant_metadata property when doing so, and use
that to decide whether to store an extra copy of the gang headers, compared to
the underlying blocks.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
With the advent of fast dedup, there are no longer separate dedup tables
for different copies values. There is now logic that will add DVAs to
the dedup table entry if more copies are needed for new writes. However,
this interacts poorly with ganging. There are two different cases that
can result in mixed gang/non-gang BPs, which are illegal in ZFS.

This change forces updates of existing FDT entries to only add more DVAs
of the same type; if there are already gang DVAs in the FDT, we require
that the new allocations also be gangs. if there are non-gang DVAs in
the FDT, then this allocation may not be gangs. If it would gang, we
have to redo the whole write as a non-dedup write.

Sponsored by: iXsystems, Inc.
Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
@pcd1193182
Copy link
Contributor Author

I ran into an exciting one with this last commit. The bug presents as follows:

  • One write comes in with copies = 1, we start an IO for it (ZIO#1)
  • Another write comes in with copies = 2, we start another IO and set that to be the dde_lead_zio (ZIO#2) (See zio_ddt_write)
  • For whatever reason (in this case, ZIO#1 gangs, which means we can't safely add to the DVAs), ZIO#2 fails
  • ZIO#1 finishes, copies its dde_phys into dde_orig_phys (See zio_ddt_child_write_done)
  • ZIO#1 adds references to dde_phys (zio_ddt_child_write_done)
  • ZIO#2 calls its done func now that its child is done, it rolls the dde_phys back to the dde_orig_phys , without the references (zio_ddt_child_write_done)
  • Now the refcount is zero, so we free it immediately, even though ZIO#1 finished successfully and the block should stick around

I'm pretty sure this bug dates back to the start of FDT, and there may be more refcount-loss bugs like this lurking; I'm going to look into that next. The fix in this case is to add the references to dde_phys before we copy it into the orig_phys.

@pcd1193182 pcd1193182 added the Status: Code Review Needed Ready for review and testing label Mar 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants