-
Notifications
You must be signed in to change notification settings - Fork 452
Initial implementation for fancy devicememcpybatch #675
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your contribution, @mfbalin! This already looks great!
We believe that the generic iterator use case should get its own interface called DeviceCopy::Batched
living in cub/device/device_copy.cuh
. The two can keep the common implementation with the specializations you have introduced. Only a new interface with slightly different semantics: DeviceCopy::Batched
's buffer_sizes
will be the number of items (instead of number of bytes). This is just a matter of documentation. The idea is that DeviceCopy::Batched
works with arbitrary iterators to iterators and will become the batched version of std::copy
, whereas DeviceMemcpy::Batched
is the batched version std::memcpy
.
Could you please also add tests that make sure the new functionality works as expected? I think we should also have a test where each buffer is a fancy output iterator, like a TransformOutputIterator
, if possible.
Again, thanks a lot for your contribution to this project!
That was the bug I was experiencing in my use case, thanks for pointing out that memcpy requires buffer sizes in bytes, not items. The code works now, I will see how much performance improvement I will get for my devicerunlength::decode use case. Is it not acceptable at all for devicememcpy to accept fancy iterators? When the input iterators are trivial to evaluate such as my use case where buffer
Multiple reads of the exact same location in the buffer will be free in this use case assuming that everything will be inlined, and it will be perfectly reasonable to use memcpy over copy from a performance standpoint. I would imagine for the same task in this scenario, the performance of copy can only be equal to or slower than memcpy due to the variable size of the items being copied. If the items were to get too large, copy will have to use a single thread for each item, unlike memcpy which only sees bytes, reducing the amount of available parallelism. Or am I wrong and would it always be better to use a copy over memcpy for all fancy input iterator use cases? |
What else would I need to modify when it comes to the assignment of work to threads when switching from byte logic to item logic to implement a DeviceCopy? Or could I just match each byte to an item and whenever a byte is read and written in the code, I can read and write an item in the code? P.S. That is the approach I am currently taking in the code. |
617853f
to
b5b20c8
Compare
The implementation will accept fancy iterators, it's just that the API for this use case will be different. You can take
The same is the case for the
I think there's no blanket answer here: It's reasonable to assume that it will be faster to copy whole elements of, e.g.,
Thanks, that looks exactly right to me. That's also what I have had in mind. 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the great work! I'm really looking forward to having this land in CUB 🚀
This is first part of the review. Just minor comments. I'll try to conclude the rest of the review today.
25798e7
to
a188aea
Compare
The code already seems to work. Tuning parameters will probably need to be modified to take the type size into account. I think I can have the next round of reviews. I will continue to work on the test. |
3b3e1a1
to
4309942
Compare
b181c27
to
997cc11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, @mfbalin! Thanks a lot for the contribution and the excellent effort! I have left a few comments, mostly just minor ones.
Could you also please add a test that where an output range is a fancy output iterator. E.g., a TransformOutputIterator
that doubles the values being written. It can be a simple test, as the test we have for the 64-bit offsets. This is mostly just to ensure we didn't miss anything in the specialization and to make sure things compile fine.
test/test_device_batch_copy.cu
Outdated
//--------------------------------------------------------------------- | ||
// The most granular type being copied. Buffer's will be aligned and their size be an integer | ||
// multiple of this type | ||
using AtomicCopyT = int64_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is even more generic than the DeviceMemcpy::Batched
it would be great if we could test for different item types. Like, char
, int64_t
, and one non-built-in type. For the non-built-in type you could use a thrust::zip_iterator
to give you thrust::tuple
s - or a different custom type, whichever you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test code currently includes different types such as int64_t, tuple<int64_t, int32_t int16_t, char, char>, uint8_t (at the end for the big buffer test). The current code uses fancy iterators for both input and output by using a constant iterator for input buffers and transform output iterator (with Identity operation) for the output.
Excellent! 🚀 Exactly, we can do the tuning policy work in a follow-up PR, if needed. I've run the existing benchmarks on the I'll check if I can extend the existing benchmarks to cover
|
When I use a fancy output iterator, the test doesn't do anything at all, do you have any idea why that may be? Problem solved: It was because the lines I copied from had different sets of arguments in the two calls to devicememcpy::batched, so the two calls calculated different amounts of temporary storage bytes since argument types were different. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution! I've left a few minor comments below. Once those are addressed, I'll start testing.
|
||
/** | ||
* @brief cub::DeviceCopy provides device-wide, parallel operations for copying data. | ||
* \ingroup SingleModule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No action required, but one can argue that this facility is currently closer to SegmentedModule
. Since DeviceCopy
will potentially contain non-segmented/batched methods, I think the strict separation into Single
and Segmented
modules in our docs should go away as part of the new documentation system.
I must have addressed all your reviews with the last commit. Thanks for the review, this is a learning opportunity to write production-quality code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing the review notes! We can start testing now. Please, make sure to squash your commits before merging.
add DeviceCopy algorithm interface, using memcpy for now. switch from byte copy logic to item logic adding a test change template names and better static_assert msg remove cast to pointer from output iterator also change to using ::cuda::std apply clang-format address reviews update example with runlengthdecode update the test and fix compile error refactor vectorized copy fix a bug refactor read_byte function also fix minor comment in code snippet fix the test, now works and passes refactor the test a bit more address another round of reviews use fancy output iterator, test broken know though replace buffers with ranges in the test fix a typo fix a bug in the test add tuple copy test also change memcpy to IsMemcpy fix the void * issue with a level of indirection address reviews about complex types
023b37c
to
5523b9d
Compare
Thanks a lot for your contribution and addressing all the review comments, @mfbalin! We're now testing your PR. |
This is my initial go at enabling fancy iterators for source buffers for the devicememcpy::batched algorithm, aiming to address #674. I didn't make any attempt at changing the byte copy abstraction, thus I expect the code to behave exactly the same way as if input memory buffers were passed. However, this may not be optimal because if a user passes an expensive source iterator, it will be accessed more times than necessary.
I am wondering if this way of enabling fancy iterators is acceptable at all or should I try to completely specialize the implementation depending on the type of the items being copied. Also, the code is currently not working for my use case at all so would appreciate it if I broke something obvious and someone can point it out.