-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs list used/free/refer up to 10% *smaller* than sendfile size for large recordsize pools/datasets (75% for draid!) #14420
Comments
Cause of issueThanks to assistance from @allanjude and @rincebrain, this occurs due to the following:
Brief summary of the above: In a mailing list thread related to this similar issue, @ahrens had a few things to say with respect to confusion stemming from 'odd' observations resulting from this issue:
...and so here is my attempted crack at it: Proposed solution:edit - see revised solution below (old proposal here)Create and expose a zpool property called `raidzshift` (expected average block size for raidz vdevs), which defaults to 17 (128K) unless specified during pool creation. This value is then used instead of the hardcoded 128K value in the existing [vdev_set_deflate_ratio](https://github.com/openzfs/zfs/blob/master/module/zfs/vdev.c#L1873) call, enabling more accurate zfs list used/free/refer, du, etc. reporting for when the raidz vdev average block size is expected to fall closer to the specified `raidzshift` value. This should correct accuracy for both ends of the spectrum:
Making this value adjustable on-the-fly will likely be more complex than having it specified at pool creation, as suggested by the warning in the note on the code. |
I think there may be some issues with letting users modify this value on the fly, more investigation will be required |
A few additional points and a revised proposal:
For further pathfinding I repeated a portion of the experiment on a draidz3 of matching parameters (no spares for simplicity): The higher write amplification for smaller blocks on draid is as expected, however it appears that driad too has a space calculation that assumes 128K stripe width. Those -50% and -75% extreme cases correlate to a stripe width of 256KB (2x 128K) and 512K (4x 128K). Raising ashift lowers the width necessary to have the effect (e.g. 16d + ashift=14), and as this effect also impacts the free space reported, a 16d ashift=14 draid will have its reported free space cut in half regardless of the intended use case or recordsize employed. That example draid filled with 256K files would report half the expected used/refer values as well (e.g. a zfs send stream would be 2x the size indicated by the refer of its associated snapshot). Revised proposed solution:Given the following:
My revised recommendation is to do away with the
This change will place the raidz and draid behaviors in line with other zfs vdev types, as well as with other file systems. |
This area has certainly been a source of confusion. I agree from the end user perspective it's not at all obvious what factors come in to play when estimating usable capacity, particularly for complicated layouts. This has gotten even more involved with dRAID, see #13727 (comment) for an explanation of why. And raidz expansion will complicate it further as you pointed out. I think getting rid of the |
My initial 'light touch' and 'even lighter touch' takes based on the present observed behavior and some code review, and assuming
2 probably makes more sense as most pools today are not so significantly impacted, but newly created pools with larger blocks and ashifts, which may be more likely to see the effects, would naturally have this issue rectified by the change. This will slowly mitigate future confusion caused by 'negative' sizes as they would no longer be possible, leaving only the existing behavior of smaller blocks and padded stripes having their expected amplified effect on consumption, and making estimations of usage far more predictable and consistent given that dsize/used/free/refer would no longer have a progressively irrelevant ratio applied to them from pool to pool. |
Sounds like people are confused by (uncompressed) files "using" less space than expected, on raidz, when the recordsize property has been increased from the default of 128k. But we are OK with files "using" more space than expected? If that's the case, I think a simple solution would be to base the deflate ratio on recordsize=1M (the highest that it can be set without a tunable). Then, by default (recordsize=128k) files will use more space than expected. That might confuse a lot of people, since it's the default. We might consider changing the default recordsize to 1M at the same time that we change the deflate ratio to be based on 1MB blocks. Removing the deflate ratio entirely (setting it to 1) seems like it would be even more confusing, since everything would "use" substantially more space than expected. (not to mention the differing behavior on different pools / software would be very noticeable and confusing) |
Some change will need to be made eventually as the current method will only add more confusion and inconsistencies over time. The question is will it be fixed in a way that is future-proof and does not need constant adjustment (leading to even further confusion). Additional point on further thought: |
1MB may be too aggressive as a default recordsize given how poorly some random workloads can behave. Moving to 1MB there would cut IOPS for that work down to 1/8th of the 128K result, which was already noticeably slow. 128K is probably the safe middle ground as a default, even if it is not ideal for either end of the spectrum. |
Ah, I forgot about that. Then we could base the deflate ratio on recordsize=16M. |
That would indeed make things much better, but please consider the following:
...we're 99.85% of the way there! |
currently recordsize can't go any larger without changing the structure of a block pointer, so it is less of a concern. I think more thought is required here, I've not had a chance to try to digest all of the implications. This might be a good topic for tomorrow's ZFS leadership call. |
Interesting, but would there not have already been inflation with 8K/16K block sizes with the 128K base? The max percent change moving from 128K to 16M should be 10% (worst case raidz3), and that is only for those geometries that look really bad for 128K specifically. The effect of this change should do nothing beyond what was already seen with a current raidz of 1/2/4/8 data disks (perfect ratio), and 8/16K blocks would have similar parity overhead either way. For layouts suboptimal for 128K blocks, 8/16K blocks would be less impacted across those different geometries than the larger 128K blocks would, as they are already consuming some multiple of the reported free space. The 'aliasing' of the 128K base derived ratio means that those zvols would see a greater variation in free/consumed space across various geometries than they would with the ratio's base being a larger value. ... and now with the above figures translated to reflect what is happening today with respect to free space reporting 'error': The blue regions above are where file sizes can appear to be smaller than they are, by the percentages indicated. Note that there is minimal impact on the values at 8K and 16K vs. what was happening already in the prior chart. Also note that the first chart here is the result of an exercise in what is expected without assuming the 128K base for the ratio, meaning the current behavior introduces some error both above and below 128K that is not normally anticipated, even by those who have previously attempted this math. Making the proposed change will make the behavior match those prior exercises, while setting the base to 16M will produce a result which still does not precisely fit those exercises, but it is close: |
I think there may be a confusion about what it means for the deflate ratio to be "1". To me this would mean that no deflation occurs. i.e. all space reporting is "bytes for data + parity". So if you have RAIDZ1 on 4x 1TB disks, you would have "4TB avail", but you can write at most 3TB of (uncompressed) user data. Writing a 1TB file (uncompressed) would use at least 1.33TB. This seems counterintuitive to me, which is why we implemented the deflate ratio to begin with, ~20 years ago. But this is not consistent with your assertion that "Basing deflate_ratio on 16M brings the value so close to 1 " Are you suggesting that we instead base the deflate ratio on an idealized parity ratio, e.g. 75% data given RAIDZ1 on 4 disks? Is your point that we should ignore the specifics of recordsize, ashift, etc. and assume that we'll get the best possible theoretical ratio? E.g. showing "3TB avail" in the above example. The effect of this would be negligibly different from basing the deflate ratio on 16MB blocks, so if this is the user-visible behavior that we want, I think we can leave this as an implementation detail to be worked out by whoever implements it. (i.e. do we have the current deflate code but with 16MB block assumption, do we have deflate code but calculate the ratio slightly differently based on the ideal ratio, do we remove the deflate code and have some different way of achieving the same result) |
Parity does not come into play for the issue I am seeking to be corrected here. It is down to how the (current) 128K base value meshes with the number of data disks in the stripe vs. how larger blocks 'fit' into those stripes. ...now if the deflate_ratio is also accounting for parity, then I missed that piece, and please substitute my prior arguments with "whatever math results in a deflate_ratio that accounts for parity but ignores any arbitrary max_recordsize base value" :) To be extra clear, yes, parity should absolutely be accounted for in the free space calculation! |
Accounting for parity is the main point of the deflate_ratio. |
Understood. Thank you for clearing that point for me. Does this mean that a raidz which began at 75%, but was then expanded, would be stuck with the same ratio? |
@allanjude thinking further on your concern about zvols, while increasing the deflate_ratio base would only minimally impact zvols beyond what the 128K value already does, perhaps merging my two proposals would assist you in increasing the free space accuracy of a raidz containing primarily zvols:
16K blocks on a zpool configured with defalte_ratio based on 2^14 would yield 'perfect' accuracy provided the pool only contained 16K blocks - just as today's pools do with 128K blocks. |
ZFS version: zfs-2.1.2 Hey all, so I'm commenting here since #13727 is closed. I am unsure if this relates to the reported draid capacity calculation issues or end-user error. Still, I'm concerned that we only see about 79% available (21% loss) for a brand-new draid array that has been built. In short, we have an 18TB x 84-disk draid2 array: I am expecting to see approximately 1.089 PiB available after parity, but we are seeing the following reported by
The more significant issue here is that while these numbers seem to be slightly moving as the pool fills, I do not see these numbers adjusting for the underlying lustre filesystem that is mounting
So my takeaway from this is that these availability calculations, which I hope are just estimation errors, are used by applications as (absolute?) available space. It may take a complete system reboot to update these calculations for the base storage, which is unrealistic when scaling with larger capacity arrays. |
The current calcs for draid are assuming 128K records, and with your ashift=14 and 14d (and minus 2 for parity), that comes to 192K stripes across each redundancy group. The current behavior (that I'm seeking to correct with this issue) should put you as reporting only 66% of the 'expected' initial free space. If you were only storing 128K records then that would be a correct assumption, but it appears that your intent is to store larger records (is this what you meant when you said "1M stripe")?
It's expected for them to move as data is stored more or less efficiently than the assumed 128K recordsize. I'm seeking to get this corrected such that the movement is only in the downward direction for those records stored less efficiently (e.g. smaller than the maximum possible recordsize). Another thing you should be seeing for your configuration there is that files with larger recordsizes would be reported as consuming less space than they actually did (e.g. an uncompressed file would show as roughly 1/3rd smaller than its actual/apparent size). That is another factor that I am seeking to correct. |
Understood. I can't speak to verifying less vs actual space used, but the problem is that applications that are using this number for "available" space are negatively impacted the greater this availability shift moves. It doesn't matter much for a non-high-availability array of a few Terabytes, but this becomes a big issue when moving into the Petabyte and greater range. |
Aah ok, in that case, it would likely be reporting closer to half of the expected capacity, and your larger recordsize files should appear to be ~57% of their expected size.
Those values shift over time regardless of configuration as files of varying recordsizes / compression are stored. What application is having a hard time with this? |
As previously stated above:
|
An update regarding used space being reported as less than actual space. After moving 40+TB to the draid array, we see discrepancies between the actual space in use and what is being reported. It is hard to get exact numbers, but the reported usage is about 10TB less than expected. So, it is confirmed on this end. One additional point of clarity, the data being moved already resides in other zfs pools using a 1M recordsize, so compression has already been factored into the equation for expected vs. actual. |
edit See next comment for identified cause and proposed solution
System information
Describe the problem you're observing
For some raidz/z2/z3 configurations and widths using
recordsize
> 128k:du
reports the same discrepancy.compression=off
.zdb -dbdbdbdbdbdb
showing filedsize
up to 10% smaller than the total of all block lsizes.zfs list
and potentially deceives users into believing queried files / folders are up to 10% smaller than reality, and leads to additional confusion when attempting to evaluate compression effectiveness on various files or folders impacted by this issue.Describe how to reproduce the problem
To elaborate on the symptoms, I iterated through various raidz/z2/z3 configs of varying recordsizes and widths. Initial tests showed cases where dsize was 2-3x greater than asize, so for each raidz/z2/z3 config I selected a width that was 1 data block per stripe, 2 data blocks per stripe (for small records, forcing more overhead), and then 8, 16, 32, 64, and 128 data blocks per stripe (to most efficiently store larger records). Other pool options used:
ashift=12
compression=off
(default).Each entry below represents the percent difference in
dsize
vs. the single 16M file written to the pool.Given the above data, it appears that perhaps some
dsize
math was attempting to compensate padding / overhead for smaller records (only my guess), but that math does not appear to be accurate for those cases, and it fails in the opposite direction for datasets with larger recordsizes.Choosing one of the more egregious entries above (8-wide raidz3 with 16MB recordsize, storing a single 16M file), here is an output of various commands:
Test file creation:
dd if=/dev/urandom of=testfile bs=1M count=16
(16.0M)ls -l testfile:
-rw-r--r-- 1 root root 16777216 Jan 23 13:06 test/testfile
(16.0M)du -b testfile:
16777216 test/testfile
(16.0M)du -B1 testfile:
15315456 test/testfile
(14.6M)zfs list test: (refer=14.8M)
zpool list test:
zfs snapshot test@1;zfs send -RLPcn test@1|grep size
size 16821032
(16.1M)zdb -dbdbdbdbdbdb test/ 2 (dsize=14.6M)
testfile as viewed by Windows via SMBD: (14.6M)
The text was updated successfully, but these errors were encountered: