Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement allocation size ranges and use for gang leaves #17111

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

pcd1193182
Copy link
Contributor

Motivation and Context

This PR pulls out the dynamic allocation size changes from #17004 into their own PR. It also removes the de-ganging logic, which isn't necessary and is difficult to implement without some pretty unpleasant-looking code.

Description

We teach the metaslab code how to allocate a range of sizes, not just a single target size, and use that in the gang block code to try to get it to allocate fewer leaves when possible.

How Has This Been Tested?

Manual testing + the new gang block test.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

module/zfs/zio.c Outdated
Comment on lines 3253 to 3260
for (int d = 0; d < BP_GET_NDVAS(bp); d++) {
uint64_t asize = DVA_GET_ASIZE(&bp->blk_dva[d]);
if (asize > allocated_size)
allocated_size = asize;
}
boolean_t allocated = allocated_size != 0;

uint64_t lsize = allocated ? allocated_size : min_size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is broken. Once you passed psizes to metaslab_alloc_range(), it converted them to asizes for each vdev and allocated something. But it does not translate the asizes back into a psize. I am not even sure who fills the PSIZE in the block pointer in this case and how, but it definitely can not be the maximum of the DVA asizes you got, as you calculate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's the case. The metaslab code allocates a chunk of space. That size is stored in the ASIZE of the DVA it filled in (later, the ASIZE can change to reflect things like gang headers, but for now it's the raw allocation size). The only part of that logic that changed in this code is that the asize is now modified by the actual allocation path to return precisely how big of a range it allocated (see metaslab_alloc_dva).

This code gets the minimum (not the maximum) of the sizes in the DVAs that were allocated. Then, because the gang leaves are not compressed, we pass that size as the size to write for the child IO. This is fundamentally the same thing as the previous code, with the one added tweak that we need to get the actual allocated size out somehow. Rather than having an array of outparams of sizes for each of the DVAs that were allocated, I just retrieve it directly from the DVAs, where it is stored anyway.

I'm not sure where PSIZE comes into it, or why PSIZE should be different than ASIZE in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no, this does get the maximum, not the minimum. That part is definitely wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, and no. IIRC ASIZE of RAIDZ includes parity and padding required for the requested PSIZE. I think your code is indeed allocating reasonable ASIZEs, simply because due to padding RAIDZ spacemaps should never have empty holes of the wrong size. But if you got some 16KB of ASIZE for some vdev, without knowing VDEVs topology you have no idea how much actual data you can write there to fill the PSIZE field in BP and pass the data to zio_write().

And to my eyes the allocated_size variable is really maximum, not minimum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, RAIDZ (and I guess dRAID as well). That's true, we do need to somehow account for the extra parity overhead of raidz devices after the asize gets converted in metaslab_alloc_dva(). Right now the BP's PSIZE is just set from the io_size, so it doesn't need to know about raidz/draid.

I can think of two ways to fix this. One is to not have real dynamic allocations, but instead "try allocating a bunch of different sizes and whichever one works, great". We would know the psize that worked, so we could just use it. That, unfortunately, is pretty awful from a performance perspective. Running the whole allocation path several times is a lot more expensive than just doing a slightly smarter check in the block picker. The other option is that we need to add an inverse asize-to-psize transformation, and use that to pass the effective allocated psize all the way out. raidz and draid are the only two vdev types that implement vdev_op_asize, so that shouldn't be too bad. I'll take a crack at that.

module/zfs/zio.c Outdated
Comment on lines 3262 to 3263
zio_t *cio = zio_write(zio, spa, txg, bp, has_data ?
abd_get_offset(pio->io_abd, pio->io_size - resid) : NULL,
lsize, lsize, &zp, zio_write_gang_member_ready, NULL,
zio_write_gang_done, &gn->gn_child[g], pio->io_priority,
ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark);
ZIO_GANG_CHILD_FLAGS(pio) |
(allocated ? ZIO_FLAG_PREALLOCATED : 0), &pio->io_bookmark);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the ZIO_FLAG_PREALLOCATED hack to zio_write(). I think the proper way would be to use zio_rewrite() if allocation succeeded or zio_write() otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered trying to use something besides zio_write, but rewrite IOs don't quite do the same things as write IOs, and I estimated that it would be more invasive to make all the necessary changes to the rewrite logic, rather than to use this approach.

Copy link
Member

@amotin amotin Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking on ZIO_WRITE_PIPELINE vs ZIO_REWRITE_PIPELINE, the last is missing everything related to space allocation. I might miss something in this area, but just above you might see that for gang header, which we allocated explicitly, we are using zio_rewrite(), that makes me think here should be the same. If we fail to allocate though, we'd need to go the full another round via zio_write() to allow it create another gang header, etc.

PS: Looking again, I see that zio_rewrite() does miss some arguments, like ready, but I wonder if that is just because they were never needed, not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the relevant extra parameters in this case are the zio_prop_t and the ready function. The ready function isn't too bad, zio_write_gang_member_ready basically just updates the parent's asize to reflect the sizes of the gang child IOs. That could be called as soon as we have the allocation done. io_prop is the trickier one; it is used in a lot of places, and I was concerned about correctly tracking down all the ramifications of having rewrite IOs with some of those properties set.

The other relevant factor is space accounting; because the REWRITE_PIPELINE doesn't have the DVA_ALLOCATE stage, it's considered non-allocating by the zio pipeline. This means it doesn't quite do the same things with the write throttle and reservation code as these pre-allocated gang children. It could be made to work, of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looking back, I did this experiment earlier, and just setting io_prop directly didn't seem to cause any serious issues. I'll try converting this back to that approach, though I will also want to do some more inspection to see if it might cause any other sneaky issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm remembering the problems this runs into. rewrite IOs do technically call the zio_write_compress stage, but the only thing they do there is wait for any children they have to hit it. Because they're not IO_IS_ALLOCATING, they don't do anything else in that step, including setting up the other fields of the BP (like checksum, type, etc).

@amotin amotin added the Status: Revision Needed Changes are required for the PR to be accepted label Mar 3, 2025
@github-actions github-actions bot removed the Status: Revision Needed Changes are required for the PR to be accepted label Mar 3, 2025
Paul Dagnelie added 3 commits March 12, 2025 12:19
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Paul Dagnelie added 3 commits March 19, 2025 14:02
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants