-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement allocation size ranges and use for gang leaves #17111
base: master
Are you sure you want to change the base?
Conversation
module/zfs/zio.c
Outdated
for (int d = 0; d < BP_GET_NDVAS(bp); d++) { | ||
uint64_t asize = DVA_GET_ASIZE(&bp->blk_dva[d]); | ||
if (asize > allocated_size) | ||
allocated_size = asize; | ||
} | ||
boolean_t allocated = allocated_size != 0; | ||
|
||
uint64_t lsize = allocated ? allocated_size : min_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code is broken. Once you passed psizes to metaslab_alloc_range()
, it converted them to asizes for each vdev and allocated something. But it does not translate the asizes back into a psize. I am not even sure who fills the PSIZE in the block pointer in this case and how, but it definitely can not be the maximum of the DVA asizes you got, as you calculate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's the case. The metaslab code allocates a chunk of space. That size is stored in the ASIZE of the DVA it filled in (later, the ASIZE can change to reflect things like gang headers, but for now it's the raw allocation size). The only part of that logic that changed in this code is that the asize is now modified by the actual allocation path to return precisely how big of a range it allocated (see metaslab_alloc_dva
).
This code gets the minimum (not the maximum) of the sizes in the DVAs that were allocated. Then, because the gang leaves are not compressed, we pass that size as the size to write for the child IO. This is fundamentally the same thing as the previous code, with the one added tweak that we need to get the actual allocated size out somehow. Rather than having an array of outparams of sizes for each of the DVAs that were allocated, I just retrieve it directly from the DVAs, where it is stored anyway.
I'm not sure where PSIZE comes into it, or why PSIZE should be different than ASIZE in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, no, this does get the maximum, not the minimum. That part is definitely wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, and no. IIRC ASIZE of RAIDZ includes parity and padding required for the requested PSIZE. I think your code is indeed allocating reasonable ASIZEs, simply because due to padding RAIDZ spacemaps should never have empty holes of the wrong size. But if you got some 16KB of ASIZE for some vdev, without knowing VDEVs topology you have no idea how much actual data you can write there to fill the PSIZE field in BP and pass the data to zio_write()
.
And to my eyes the allocated_size
variable is really maximum, not minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, RAIDZ (and I guess dRAID as well). That's true, we do need to somehow account for the extra parity overhead of raidz devices after the asize gets converted in metaslab_alloc_dva()
. Right now the BP's PSIZE is just set from the io_size, so it doesn't need to know about raidz/draid.
I can think of two ways to fix this. One is to not have real dynamic allocations, but instead "try allocating a bunch of different sizes and whichever one works, great". We would know the psize that worked, so we could just use it. That, unfortunately, is pretty awful from a performance perspective. Running the whole allocation path several times is a lot more expensive than just doing a slightly smarter check in the block picker. The other option is that we need to add an inverse asize-to-psize transformation, and use that to pass the effective allocated psize all the way out. raidz and draid are the only two vdev types that implement vdev_op_asize
, so that shouldn't be too bad. I'll take a crack at that.
module/zfs/zio.c
Outdated
zio_t *cio = zio_write(zio, spa, txg, bp, has_data ? | ||
abd_get_offset(pio->io_abd, pio->io_size - resid) : NULL, | ||
lsize, lsize, &zp, zio_write_gang_member_ready, NULL, | ||
zio_write_gang_done, &gn->gn_child[g], pio->io_priority, | ||
ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark); | ||
ZIO_GANG_CHILD_FLAGS(pio) | | ||
(allocated ? ZIO_FLAG_PREALLOCATED : 0), &pio->io_bookmark); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the ZIO_FLAG_PREALLOCATED
hack to zio_write()
. I think the proper way would be to use zio_rewrite()
if allocation succeeded or zio_write()
otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered trying to use something besides zio_write, but rewrite IOs don't quite do the same things as write IOs, and I estimated that it would be more invasive to make all the necessary changes to the rewrite logic, rather than to use this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking on ZIO_WRITE_PIPELINE
vs ZIO_REWRITE_PIPELINE
, the last is missing everything related to space allocation. I might miss something in this area, but just above you might see that for gang header, which we allocated explicitly, we are using zio_rewrite()
, that makes me think here should be the same. If we fail to allocate though, we'd need to go the full another round via zio_write()
to allow it create another gang header, etc.
PS: Looking again, I see that zio_rewrite()
does miss some arguments, like ready
, but I wonder if that is just because they were never needed, not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the relevant extra parameters in this case are the zio_prop_t
and the ready
function. The ready
function isn't too bad, zio_write_gang_member_ready
basically just updates the parent's asize to reflect the sizes of the gang child IOs. That could be called as soon as we have the allocation done. io_prop
is the trickier one; it is used in a lot of places, and I was concerned about correctly tracking down all the ramifications of having rewrite IOs with some of those properties set.
The other relevant factor is space accounting; because the REWRITE_PIPELINE doesn't have the DVA_ALLOCATE stage, it's considered non-allocating by the zio pipeline. This means it doesn't quite do the same things with the write throttle and reservation code as these pre-allocated gang children. It could be made to work, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, looking back, I did this experiment earlier, and just setting io_prop directly didn't seem to cause any serious issues. I'll try converting this back to that approach, though I will also want to do some more inspection to see if it might cause any other sneaky issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'm remembering the problems this runs into. rewrite IOs do technically call the zio_write_compress stage, but the only thing they do there is wait for any children they have to hit it. Because they're not IO_IS_ALLOCATING
, they don't do anything else in that step, including setting up the other fields of the BP (like checksum, type, etc).
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Motivation and Context
This PR pulls out the dynamic allocation size changes from #17004 into their own PR. It also removes the de-ganging logic, which isn't necessary and is difficult to implement without some pretty unpleasant-looking code.
Description
We teach the metaslab code how to allocate a range of sizes, not just a single target size, and use that in the gang block code to try to get it to allocate fewer leaves when possible.
How Has This Been Tested?
Manual testing + the new gang block test.
Types of changes
Checklist:
Signed-off-by
.