-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
~30% performance degradation for ZFS vs. non-ZFS for large file transfer #14346
Comments
Does running |
I will re-test and let you know @ryao. Thanks for the suggestion. |
Hi @kyle0r, do you have any updates to share with us? Thanks |
Thank you for the
Unfortunately the suggestion didn't move the needle in the desired direction... The performance degradation pattern was still present (if a little different) and then dropped off a cliff. Before and after the change:
|
To continue trying to figure this out (process of elimination), I've made an investment in a new drive (due to arrive next week). A Seagate Exos Enterprise - 10E2400 introduced 2017-Q4? This is a 12Gb/s SAS drive, 2.4TB 10K RPM with a 16GB flash cache and CMR not SMR like in the 5TB Barracuda drives I tested in post 1. Naturally the Exos drive is in a different performance/quality class but nonetheless it will be interesting to see how the Exos drive compares in the re-run of test 1 and test 3 - to see if the new drive displays the ~30% degradation or not. With my hardware setup I don't expect to have issues mixing SATA and SAS, and in the worst case I can probably move drives around to ensure that 4 bays (single cable on the passive backplane) are used for SAS drives. Kudos to my friend Alasdair for suggesting I double check this aspect. I'll write an update once the drive arrives and I've performed some testing. Some reflections since post 1Q: Why do the ZFS tests sometimes experience a ~80% seq write performance drop and never recover? See the graph from my last comment, and (see the graph in test 4) I wonder if this similar to the issues detected by ServeTheHome with WD RED SMR drives when the Z-RAID resilvering took waaaaaay longer than expected (~9 days vs. ~14 hours). Article here. Vid here. A: I don't know yet but I'd love to see some OpenZFS devs chime in here. Could it be related to OpenZFS issue #9130? Could OpenZFS detect and do something to heal these issues as they occur? Q: What are the impacts of the 30% degradation on my use cases? Q: Could I get a performance boost by not using SMR drives for the data disks? Given they are already migrated to ZFS... A note on SMR drivesI've known for a while about SMR drives being generally a bad idea but back in 2017 options for 2.5" 5TB disks were limited and I wasn't using ZFS back then. The cost per TB and density per drive bay/slot of the Barracuda's has always been very attractive for long term storage. I certainly like the idea of following Jim Salters (@jimsalterjrs) advice on using striped mirrors and the benefits of how the performance scales so well in that configuration. Need to solve my open issues first! The future is probably flash drives anyway right?I'd welcome suggestions on 5TB+ SATA SSD's that work well with ZFS. I'm researching options for SATA SSD's to replace the 5TB Barracuda's. Right now I'm thinking its a good long term goal to cycle out the SMR spindles (HDD) with flash (SSD) to eliminate this 30% degradation issue (assuming right now its ZFS + SMR drive related), and also take advantage of the other benefits of flash media vs spindles. My use cases are typically file server and sequential workloads, and when I need lots of small IOPS I already have Optane on hand. To mitigate the performance degradation I've documented here, I guess I could consider smaller than 5TB SSD's and use more of them, with the downside that it would reduce overall storage capacity of the chassis. Not sure that would be more cost effective than investing in the larger SSD's? Its not clear when SATA based SSD's will become legacy if they aren't already and manufactures will stop launching new drives. It would be really nice (for my use cases at least) to see cost effective, fast and reliable ~5TB SSD's before SATA SSD become extinct. For my next chassis which I'm currently pricing, I'll likely be going with NVMe U.2 support and shipping the current SATA/SAS chassis for co-lo hosting to form my online off-site backup. Its unclear when high capacity U.2 drives will become affordable for my use cases so for now I'll be looking at for SATA drives and then upgrade to U.2 as the costs come down. High capacity SATA SSD's on my radar so far:
I also need to be careful and/or cognitive of SSD's that support Data Set Management TRIM and Deterministic Read Zero after TRIM (RZAT) per the LSI HBA KB article. I know that my rpool (boot and root ZFS mirror pool) Crucial MX500 SSD's cannot trim when attached to the LSI HBA because of this factor.
|
I cannot comment on the other things right now, but I strongly suspect that FreeNAS 11.3's ZFS driver was missing the sequential resilver code that had been merged into ZFSOnLinux a few years earlier. That should help drives to resilver faster and should reduce the impact that SMR has. |
SATA SMR vs. SAS CMR ... *FIGHT*Disks on test: For each test, the XFS file system was provisioned with 2000 GiB, which is +/- 10% less than the maximum capacity of the disk. Note that this is smaller than the tests in post 1 because of the smaller disk size, but everything else about the tests remains the same. It is expected that the src to dst rsync jobs will fail due to lack of space for the 2.61TiB snapraid parity file. Nevertheless, the test approach should still provide good comparisons, as the problems detected will have presented themselves before the tests run out of space. The same src disk and snapraid parity file were used in the test re-runs per post 1 tests. Recap on whats being compared
re-run of
|
My suggestion is to set recordsize=1M, which should mitigate the negative effects of SMR by minimizing the amount of RMW needed to do writes. Also, if you do not need a time updates, which you likely do not, set atime=off. |
@ryao wrote:
Acknowledged - I'll give both options a test on one of the SATA SMR drives and see if it helps. I did try testing with recordsize=256K (the same as the snapraid parity default block size) but that didn't seem to help. AFAIK for snapraid parity drives which store a single large file - atime=off should be fine. FYI. I am familiar with RMW topics per here. I'm not an expert in understand the behaviour 100% yet though. Any comments on the weird read IO on the SAS drive? |
It is hard to say without data on the number of outstanding IOs to go with that chart. That said, I feel like it is somewhat related to this: https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/ZIO%20Scheduler.html There were also some fixes done recently in master to improve prefetch behavior that might help. |
For zfs, I would stay away from smr drives. I still remember I had trouble to do "zpool replace" on a single-smr-disk pool. It never completed and degraded the pool! I ended up just dd it to a cmr drive. |
I think ZFS is not suitable for SSDs at all. Keep in mind insane write amplification. |
You can think whatever you want. ZFS has a lot of functionality that comes at certain cost. But it is not insane. On the other side ZFS tries to write sequentially, aggregate small I/Os into bigger ones and supports TRIM/UNMAP -- it does what it can to help SSDs. And it can be fast also -- I've made presentations where ZFS doing 20GB/s of read/write or 30GB/s of scrub. Just don't ask it to do impossible, like multiple millions of random 4KB writes -- functionality does have a cost and overhead.
There is a reason why we have a separate hardware qualification team -- even with enterprise market devices there are plenty of issues. It goes 10x in cheap consumer market. Don't buy cheap at least, even though it is not a guarantee. |
And users still plan to build ZFS on SSD without knowing that cost. There is no warnings in docs about that.
By writing them into ZIL that is placed on same SSD? This feature is about performance and not to help with SSD endurance.
Yes. But sometimes cost become unbearable. |
ZIL is used only when application calls fsync(). Many applications do not require fsync(). If fsync() is needed for data safety in specific application -- there is a cost. And even then, if writes are big enough and there is no SLOG, then data blocks are written only once and just referenced by ZIL. If sync writes are critical part of the workload and main pool devices can't sustain it, then add SLOG that can, like NVDIMM, write-optimized SSD, etc. On opposite, if you don't care, you can always set sync=disabled for specific dataset, disabling ZIL completely. |
System information
Describe the problem you're observing
I was migrating XFS disk partitions from full disk (no zfs) to XFS raw disk images stored on zfs datasets... both virtual disks provisioned to a kvm via virtio_blk.
I spotted some strange performance issues and ended up doing a bunch of testing/research to try and figure it out.
In my testing I'm witnessing a fairly remarkable and repeatable ~30% performance degradation pattern visible in the netdata graphs, and transfer runtime. IO size remains constant but IOPS and IO bandwidth drop off significantly when compared to tests without ZFS.
This ~30% degradation applies to ZFS single disk, mirror or striped pool. I tested all 3 configs.
Scenario: My testing is primarily measuring the transfer (seq write) and checksumming (seq read) of a 2.61TiB snapraid parity file between two disks. Its a single large file sorted on the XFS file system. The tests are running inside a kvm. The physical disks are 5TB 2.5” 5600 rpm SMR Seagate Barracuda’s. For ZFS tests the XFS file system is stored on a raw disk image on a zfs dataset and provisioned to the kvm via virtio_blk.
I’m suspicious the root cause of the degradation could be related to #9130 because the issue reproduced in
test #4
but that might also be a red herring.Here are some graphs for illustration of the degradation:
write to xfs partition - no zfs:

write to xfs partition stored on zfs dataset:

read from xfs partition - no zfs:

checksum read from partition stored on zfs dataset:

I've published my related research here. I've shared all the details and raw results and graphs there. It feels too much to information re-host it in this issue?
Overall conclusion(s)
Test #1
was the best performing OpenZFS result and all attempts to improve the results were unsuccessful. 😢IO size remains constant but IOPS and IO bandwidth drop off significantly when compared to
test #3
without ZFS.test #10
2x striped zpool was still slower than a single disk non-zfstest #3
.Test #3
demonstrates the kvm virtio_blk can handle at least 121 and 125 MiB/s seq writes and reads on these disks i.e. kvm and virtio_blk overhead is unlikely to causing the performance degradation or bottlenecks.Test #15
demonstrates that virtio_scsi is not faster than virtio_blk and likely has more overhead.Test #4
.Having met this issue in the past, I’m suspicious the root cause of task txg_sync blocked for more than 120 seconds. #9130 may be related to/and or causing the IO degradation observed in these tests?
test #10
demonstrated fairly consistent IO bandwidth on both disks during the read and write tests.The obvious degradation in the single vdev tests was not visible in the graphs - however overall the performance was still ~30% under the physical maximums of the disks.
Question: why does the IO degradation pattern seem to disappear in this test but not others?
Question: why is there a ~30% overhead for ZFS?
Test #1
(parity file 1 of 3) andTest #17
(parity file 2 of 3) suggest issue is not specific to one file.test result summary
Describe how to reproduce the problem
Transfer e.g. rsync a >2TiB file to a virtual disk provisioned via zfs dataset raw disk image via virtio_blk. Details of my, hardware, config test cases and commands can be found in my research here.
Include any warning/errors/backtraces from the system logs
Test #4
reproduced OpenZFS issue #9130 “task txg_sync blocked for more than 120 seconds”. Here are the related logs:For discussion/illustration - IO Flow without and with OpenZFS
The text was updated successfully, but these errors were encountered: