Implement allocation size ranges and use for gang leaves #17111

pcd1193182 · 2025-03-03T18:40:20Z

Motivation and Context

This PR pulls out the dynamic allocation size changes from #17004 into their own PR. It also removes the de-ganging logic, which isn't necessary and is difficult to implement without some pretty unpleasant-looking code.

Description

We teach the metaslab code how to allocate a range of sizes, not just a single target size, and use that in the gang block code to try to get it to allocate fewer leaves when possible.

How Has This Been Tested?

Manual testing + the new gang block test.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin · 2025-03-03T19:53:57Z

module/zfs/zio.c

+		for (int d = 0; d < BP_GET_NDVAS(bp); d++) {
+			uint64_t asize = DVA_GET_ASIZE(&bp->blk_dva[d]);
+			if (asize > allocated_size)
+				allocated_size = asize;
+		}
+		boolean_t allocated = allocated_size != 0;
+
+		uint64_t lsize = allocated ? allocated_size : min_size;


I think this code is broken. Once you passed psizes to metaslab_alloc_range(), it converted them to asizes for each vdev and allocated something. But it does not translate the asizes back into a psize. I am not even sure who fills the PSIZE in the block pointer in this case and how, but it definitely can not be the maximum of the DVA asizes you got, as you calculate.

I don't think that's the case. The metaslab code allocates a chunk of space. That size is stored in the ASIZE of the DVA it filled in (later, the ASIZE can change to reflect things like gang headers, but for now it's the raw allocation size). The only part of that logic that changed in this code is that the asize is now modified by the actual allocation path to return precisely how big of a range it allocated (see metaslab_alloc_dva).

This code gets the minimum (not the maximum) of the sizes in the DVAs that were allocated. Then, because the gang leaves are not compressed, we pass that size as the size to write for the child IO. This is fundamentally the same thing as the previous code, with the one added tweak that we need to get the actual allocated size out somehow. Rather than having an array of outparams of sizes for each of the DVAs that were allocated, I just retrieve it directly from the DVAs, where it is stored anyway.

I'm not sure where PSIZE comes into it, or why PSIZE should be different than ASIZE in this case.

Ah, no, this does get the maximum, not the minimum. That part is definitely wrong.

No, and no. IIRC ASIZE of RAIDZ includes parity and padding required for the requested PSIZE. I think your code is indeed allocating reasonable ASIZEs, simply because due to padding RAIDZ spacemaps should never have empty holes of the wrong size. But if you got some 16KB of ASIZE for some vdev, without knowing VDEVs topology you have no idea how much actual data you can write there to fill the PSIZE field in BP and pass the data to zio_write().

And to my eyes the allocated_size variable is really maximum, not minimum.

Ah, RAIDZ (and I guess dRAID as well). That's true, we do need to somehow account for the extra parity overhead of raidz devices after the asize gets converted in metaslab_alloc_dva(). Right now the BP's PSIZE is just set from the io_size, so it doesn't need to know about raidz/draid.

I can think of two ways to fix this. One is to not have real dynamic allocations, but instead "try allocating a bunch of different sizes and whichever one works, great". We would know the psize that worked, so we could just use it. That, unfortunately, is pretty awful from a performance perspective. Running the whole allocation path several times is a lot more expensive than just doing a slightly smarter check in the block picker. The other option is that we need to add an inverse asize-to-psize transformation, and use that to pass the effective allocated psize all the way out. raidz and draid are the only two vdev types that implement vdev_op_asize, so that shouldn't be too bad. I'll take a crack at that.

amotin · 2025-03-03T19:55:33Z

module/zfs/zio.c

+		zio_t *cio = zio_write(zio, spa, txg, bp, has_data ?
+		    abd_get_offset(pio->io_abd, pio->io_size - resid) : NULL,
+		    lsize, lsize, &zp, zio_write_gang_member_ready, NULL,
 		    zio_write_gang_done, &gn->gn_child[g], pio->io_priority,
-		    ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark);
+		    ZIO_GANG_CHILD_FLAGS(pio) |
+		    (allocated ? ZIO_FLAG_PREALLOCATED : 0), &pio->io_bookmark);


I don't like the ZIO_FLAG_PREALLOCATED hack to zio_write(). I think the proper way would be to use zio_rewrite() if allocation succeeded or zio_write() otherwise.

I considered trying to use something besides zio_write, but rewrite IOs don't quite do the same things as write IOs, and I estimated that it would be more invasive to make all the necessary changes to the rewrite logic, rather than to use this approach.

Looking on ZIO_WRITE_PIPELINE vs ZIO_REWRITE_PIPELINE, the last is missing everything related to space allocation. I might miss something in this area, but just above you might see that for gang header, which we allocated explicitly, we are using zio_rewrite(), that makes me think here should be the same. If we fail to allocate though, we'd need to go the full another round via zio_write() to allow it create another gang header, etc.

PS: Looking again, I see that zio_rewrite() does miss some arguments, like ready, but I wonder if that is just because they were never needed, not sure.

Yeah the relevant extra parameters in this case are the zio_prop_t and the ready function. The ready function isn't too bad, zio_write_gang_member_ready basically just updates the parent's asize to reflect the sizes of the gang child IOs. That could be called as soon as we have the allocation done. io_prop is the trickier one; it is used in a lot of places, and I was concerned about correctly tracking down all the ramifications of having rewrite IOs with some of those properties set.

The other relevant factor is space accounting; because the REWRITE_PIPELINE doesn't have the DVA_ALLOCATE stage, it's considered non-allocating by the zio pipeline. This means it doesn't quite do the same things with the write throttle and reservation code as these pre-allocated gang children. It could be made to work, of course.

Actually, looking back, I did this experiment earlier, and just setting io_prop directly didn't seem to cause any serious issues. I'll try converting this back to that approach, though I will also want to do some more inspection to see if it might cause any other sneaky issues.

Right, I'm remembering the problems this runs into. rewrite IOs do technically call the zio_write_compress stage, but the only thing they do there is wait for any children they have to hit it. Because they're not IO_IS_ALLOCATING, they don't do anything else in that step, including setting up the other fields of the BP (like checksum, type, etc).

module/zfs/vdev_raidz.c

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

pcd1193182 force-pushed the leaf_size branch from 749abdc to 5baffbe Compare March 3, 2025 18:53

amotin reviewed Mar 3, 2025

View reviewed changes

amotin added the Status: Revision Needed Changes are required for the PR to be accepted label Mar 3, 2025

github-actions bot removed the Status: Revision Needed Changes are required for the PR to be accepted label Mar 3, 2025

pcd1193182 force-pushed the leaf_size branch from 1a14e87 to 1b82600 Compare March 7, 2025 20:13

amotin reviewed Mar 10, 2025

View reviewed changes

module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved

Paul Dagnelie added 3 commits March 12, 2025 12:19

Implement allocation size ranges and use for gang leaves

6ebfddd

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Get minimum asize, not maximum

5b25720

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Add asize to psize logic

c206fcf

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

pcd1193182 force-pushed the leaf_size branch from 54f586e to 7907c1c Compare March 12, 2025 19:31

Paul Dagnelie added 3 commits March 19, 2025 14:02

tweak size selection

97044f7

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Alexander's feedback

19a508b

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

style

6a2a041

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

pcd1193182 force-pushed the leaf_size branch from 7907c1c to 6a2a041 Compare March 19, 2025 21:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement allocation size ranges and use for gang leaves #17111

Implement allocation size ranges and use for gang leaves #17111

pcd1193182 commented Mar 3, 2025

amotin Mar 3, 2025

pcd1193182 Mar 3, 2025

pcd1193182 Mar 3, 2025

amotin Mar 3, 2025

pcd1193182 Mar 3, 2025

amotin Mar 3, 2025

pcd1193182 Mar 3, 2025

amotin Mar 3, 2025 •

edited

Loading

pcd1193182 Mar 3, 2025

pcd1193182 Mar 3, 2025

pcd1193182 Mar 3, 2025

Implement allocation size ranges and use for gang leaves #17111

Are you sure you want to change the base?

Implement allocation size ranges and use for gang leaves #17111

Conversation

pcd1193182 commented Mar 3, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotin Mar 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotin Mar 3, 2025 •

edited

Loading