Initial implementation for fancy devicememcpybatch #675

mfbalin · 2023-04-11T02:43:32Z

This is my initial go at enabling fancy iterators for source buffers for the devicememcpy::batched algorithm, aiming to address #674. I didn't make any attempt at changing the byte copy abstraction, thus I expect the code to behave exactly the same way as if input memory buffers were passed. However, this may not be optimal because if a user passes an expensive source iterator, it will be accessed more times than necessary.

I am wondering if this way of enabling fancy iterators is acceptable at all or should I try to completely specialize the implementation depending on the type of the items being copied. Also, the code is currently not working for my use case at all so would appreciate it if I broke something obvious and someone can point it out.

elstehle

Thanks a lot for your contribution, @mfbalin! This already looks great!

We believe that the generic iterator use case should get its own interface called DeviceCopy::Batched living in cub/device/device_copy.cuh. The two can keep the common implementation with the specializations you have introduced. Only a new interface with slightly different semantics: DeviceCopy::Batched's buffer_sizes will be the number of items (instead of number of bytes). This is just a matter of documentation. The idea is that DeviceCopy::Batched works with arbitrary iterators to iterators and will become the batched version of std::copy, whereas DeviceMemcpy::Batched is the batched version std::memcpy.

Could you please also add tests that make sure the new functionality works as expected? I think we should also have a test where each buffer is a fancy output iterator, like a TransformOutputIterator, if possible.

Again, thanks a lot for your contribution to this project!

cub/agent/agent_batch_memcpy.cuh

cub/device/dispatch/dispatch_batch_memcpy.cuh

mfbalin · 2023-04-12T20:00:06Z

We believe that the generic iterator use case should get its own interface called DeviceCopy::Batched living in cub/device/device_copy.cuh. The two can keep the common implementation with the specializations you have introduced. Only a new interface with slightly different semantics: DeviceCopy::Batched's buffer_sizes will be the number of items (instead of number of bytes). This is just a matter of documentation. The idea is that DeviceCopy::Batched works with arbitrary iterators to iterators and will become the batched version of std::copy, whereas DeviceMemcpy::Batched is the batched version std::memcpy.

That was the bug I was experiencing in my use case, thanks for pointing out that memcpy requires buffer sizes in bytes, not items. The code works now, I will see how much performance improvement I will get for my devicerunlength::decode use case.

Is it not acceptable at all for devicememcpy to accept fancy iterators? When the input iterators are trivial to evaluate such as my use case where buffer i is made up of [i, i, i, i, i] by using:

struct RepeatIndex {
  template <typename IdType>
  __host__ __device__ auto operator()(IdType i) {
    return thrust::make_constant_iterator(i);
  }
};

thrust::counting_iterator<int64_t> iota(0);
auto input_buffer = thrust::make_transform_iterator(iota, RepeatIndex{});

Multiple reads of the exact same location in the buffer will be free in this use case assuming that everything will be inlined, and it will be perfectly reasonable to use memcpy over copy from a performance standpoint. I would imagine for the same task in this scenario, the performance of copy can only be equal to or slower than memcpy due to the variable size of the items being copied. If the items were to get too large, copy will have to use a single thread for each item, unlike memcpy which only sees bytes, reducing the amount of available parallelism. Or am I wrong and would it always be better to use a copy over memcpy for all fancy input iterator use cases?

mfbalin · 2023-04-12T20:51:42Z

What else would I need to modify when it comes to the assignment of work to threads when switching from byte logic to item logic to implement a DeviceCopy? Or could I just match each byte to an item and whenever a byte is read and written in the code, I can read and write an item in the code?

P.S. That is the approach I am currently taking in the code.

elstehle · 2023-04-13T11:23:39Z

That was the bug I was experiencing in my use case, thanks for pointing out that memcpy requires buffer sizes in bytes, not items. The code works now, I will see how much performance improvement I will get for my devicerunlength::decode use case.
Is it not acceptable at all for devicememcpy to accept fancy iterators?

The implementation will accept fancy iterators, it's just that the API for this use case will be different. You can take std::memcpy as an example, it doesn't accept iterators, while std::copy does.... > Providing two distinct interfaces, DeviceCopy::Batched and DeviceMemcpy::Batched gives the library the flexibility to further specialize the implementations in the future

When the input iterators are trivial to evaluate such as my use case where buffer i is made up of [i, i, i, i, i] by using:
struct RepeatIndex {
  template <typename IdType>
  __host__ __device__ auto operator()(IdType i) {
    return thrust::make_constant_iterator(i);
  }
};

thrust::counting_iterator<int64_t> iota(0);
auto input_buffer = thrust::make_transform_iterator(iota, RepeatIndex{});
Multiple reads of the exact same location in the buffer will be free in this use case assuming that everything will be inlined,

The same is the case for the Copy::Batched, right?

[...] and it will be perfectly reasonable to use memcpy over copy from a performance standpoint. I would imagine for the same task in this scenario, the performance of copy can only be equal to or slower than memcpy due to the variable size of the items being copied. If the items were to get too large, copy will have to use a single thread for each item, unlike memcpy which only sees bytes, reducing the amount of available parallelism. Or am I wrong and would it always be better to use a copy over memcpy for all fancy input iterator use cases?

I think there's no blanket answer here: It's reasonable to assume that it will be faster to copy whole elements of, e.g., 4 B, 8 B than to copy the bytes of those elements individually. Only once element's get very large, I expect the byte-wise copy to gain an edge again.

What else would I need to modify when it comes to the assignment of work to threads when switching from byte logic to item logic to implement a DeviceCopy? Or could I just match each byte to an item and whenever a byte is read and written in the code, I can read and write an item in the code?
P.S. That is the approach I am currently taking in the code.

Thanks, that looks exactly right to me. That's also what I have had in mind. 👍

elstehle

Thanks a lot for the great work! I'm really looking forward to having this land in CUB 🚀

This is first part of the review. Just minor comments. I'll try to conclude the rest of the review today.

cub/device/device_copy.cuh

cub/cub.cuh

cub/device/dispatch/dispatch_batch_memcpy.cuh

cub/agent/agent_batch_memcpy.cuh

mfbalin · 2023-04-14T00:43:19Z

The code already seems to work. Tuning parameters will probably need to be modified to take the type size into account. I think I can have the next round of reviews. I will continue to work on the test.

elstehle

Great work, @mfbalin! Thanks a lot for the contribution and the excellent effort! I have left a few comments, mostly just minor ones.

Could you also please add a test that where an output range is a fancy output iterator. E.g., a TransformOutputIterator that doubles the values being written. It can be a simple test, as the test we have for the 64-bit offsets. This is mostly just to ensure we didn't miss anything in the specialization and to make sure things compile fine.

cub/device/device_copy.cuh

cub/device/dispatch/dispatch_batch_memcpy.cuh

test/test_device_batch_copy.cu

elstehle · 2023-04-14T10:06:45Z

test/test_device_batch_copy.cu

+  //---------------------------------------------------------------------
+  // The most granular type being copied. Buffer's will be aligned and their size be an integer
+  // multiple of this type
+  using AtomicCopyT = int64_t;


Since this is even more generic than the DeviceMemcpy::Batched it would be great if we could test for different item types. Like, char, int64_t, and one non-built-in type. For the non-built-in type you could use a thrust::zip_iterator to give you thrust::tuples - or a different custom type, whichever you prefer.

The test code currently includes different types such as int64_t, tuple<int64_t, int32_t int16_t, char, char>, uint8_t (at the end for the big buffer test). The current code uses fancy iterators for both input and output by using a constant iterator for input buffers and transform output iterator (with Identity operation) for the output.

test/test_device_batch_copy.cu

elstehle · 2023-04-14T10:51:50Z

The code already seems to work. Tuning parameters will probably need to be modified to take the type size into account. I think I can have the next round of reviews. I will continue to work on the test.

Excellent! 🚀 Exactly, we can do the tuning policy work in a follow-up PR, if needed.

I've run the existing benchmarks on the DeviceMemcpy::Batched to make sure we didn't introduce any major regression. Results for DeviceMemcpy seem untouched, which is good news 🙂 👍

I'll check if I can extend the existing benchmarks to cover DeviceCopy::Batched.

|  AtomicT  |  Buffer Order  |  Min. buffer size  |  Max. buffer size  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-----------|----------------|--------------------|--------------------|------------|------------|-------------|------------|-------------|------------|---------|----------|
|    U8     |     Random     |         1          |         8          |    2^25    |   4.083 ms |       0.50% |   4.080 ms |       0.50% |  -2.813 us |  -0.07% |   PASS   |
|    U8     |     Random     |         1          |         64         |    2^25    | 762.786 us |       0.45% | 762.437 us |       0.47% |  -0.350 us |  -0.05% |   PASS   |
|    U8     |     Random     |         1          |        256         |    2^25    | 414.391 us |       0.71% | 414.536 us |       0.74% |   0.146 us |   0.04% |   PASS   |
|    U8     |     Random     |         1          |        1024        |    2^25    | 348.240 us |       0.70% | 347.977 us |       0.75% |  -0.263 us |  -0.08% |   PASS   |
|    U8     |     Random     |         1          |       65536        |    2^25    | 202.079 us |       0.90% | 202.182 us |       0.85% |   0.103 us |   0.05% |   PASS   |
|    U8     |     Random     |       65536        |       65536        |    2^25    | 108.510 us |       1.24% | 108.478 us |       1.13% |  -0.031 us |  -0.03% |   PASS   |
|    U8     |     Random     |         1          |         8          |    2^27    |  18.515 ms |       0.56% |  18.516 ms |       0.50% |   0.706 us |   0.00% |   PASS   |
|    U8     |     Random     |         1          |         64         |    2^27    |   3.593 ms |       0.16% |   3.593 ms |       0.17% |  -0.066 us |  -0.00% |   PASS   |
|    U8     |     Random     |         1          |        256         |    2^27    |   1.802 ms |       0.27% |   1.803 ms |       0.28% |   0.866 us |   0.05% |   PASS   |
|    U8     |     Random     |         1          |        1024        |    2^27    | 828.094 us |       0.71% | 828.037 us |       0.67% |  -0.057 us |  -0.01% |   PASS   |
|    U8     |     Random     |         1          |       65536        |    2^27    | 536.573 us |       2.02% | 537.580 us |       1.90% |   1.008 us |   0.19% |   PASS   |
|    U8     |     Random     |       65536        |       65536        |    2^27    | 405.002 us |       2.03% | 405.484 us |       1.96% |   0.482 us |   0.12% |   PASS   |
|    U8     |     Random     |         1          |         8          |    2^29    |  77.478 ms |       0.50% |  77.471 ms |       0.49% |  -6.533 us |  -0.01% |   PASS   |
|    U8     |     Random     |         1          |         64         |    2^29    |  14.953 ms |       0.06% |  14.952 ms |       0.07% |  -0.181 us |  -0.00% |   PASS   |
|    U8     |     Random     |         1          |        256         |    2^29    |   7.393 ms |       0.11% |   7.390 ms |       0.11% |  -2.259 us |  -0.03% |   PASS   |
|    U8     |     Random     |         1          |        1024        |    2^29    |   3.404 ms |       0.34% |   3.404 ms |       0.36% |   0.376 us |   0.01% |   PASS   |
|    U8     |     Random     |         1          |       65536        |    2^29    |   1.854 ms |       1.05% |   1.849 ms |       0.92% |  -4.857 us |  -0.26% |   PASS   |
|    U8     |     Random     |       65536        |       65536        |    2^29    |   1.628 ms |       1.32% |   1.622 ms |       1.17% |  -6.064 us |  -0.37% |   PASS   |
|    U8     |  Consecutive   |         1          |         8          |    2^25    | 499.848 us |       0.38% | 499.943 us |       0.42% |   0.095 us |   0.02% |   PASS   |
|    U8     |  Consecutive   |         1          |         64         |    2^25    | 260.975 us |       0.83% | 262.726 us |       0.87% |   1.751 us |   0.67% |   PASS   |
|    U8     |  Consecutive   |         1          |        256         |    2^25    | 299.358 us |       1.10% | 299.998 us |       1.18% |   0.640 us |   0.21% |   PASS   |
|    U8     |  Consecutive   |         1          |        1024        |    2^25    | 308.905 us |       0.74% | 307.136 us |       0.83% |  -1.769 us |  -0.57% |   PASS   |
|    U8     |  Consecutive   |         1          |       65536        |    2^25    | 205.411 us |       1.04% | 205.487 us |       1.05% |   0.077 us |   0.04% |   PASS   |
|    U8     |  Consecutive   |       65536        |       65536        |    2^25    | 109.102 us |       1.33% | 109.209 us |       1.28% |   0.107 us |   0.10% |   PASS   |
|    U8     |  Consecutive   |         1          |         8          |    2^27    |   1.908 ms |       0.18% |   1.908 ms |       0.16% |  -0.273 us |  -0.01% |   PASS   |
|    U8     |  Consecutive   |         1          |         64         |    2^27    | 900.934 us |       0.23% | 903.329 us |       0.20% |   2.395 us |   0.27% |   FAIL   |
|    U8     |  Consecutive   |         1          |        256         |    2^27    |   1.093 ms |       0.53% |   1.095 ms |       0.53% |   1.862 us |   0.17% |   PASS   |
|    U8     |  Consecutive   |         1          |        1024        |    2^27    | 609.776 us |       0.79% | 612.644 us |       0.82% |   2.868 us |   0.47% |   PASS   |
|    U8     |  Consecutive   |         1          |       65536        |    2^27    | 536.132 us |       1.81% | 536.039 us |       1.64% |  -0.093 us |  -0.02% |   PASS   |
|    U8     |  Consecutive   |       65536        |       65536        |    2^27    | 406.250 us |       1.90% | 404.997 us |       1.95% |  -1.253 us |  -0.31% |   PASS   |
|    U8     |  Consecutive   |         1          |         8          |    2^29    |   7.560 ms |       0.79% |   7.558 ms |       0.50% |  -1.513 us |  -0.02% |   PASS   |
|    U8     |  Consecutive   |         1          |         64         |    2^29    |   3.471 ms |       0.16% |   3.469 ms |       0.18% |  -2.203 us |  -0.06% |   PASS   |
|    U8     |  Consecutive   |         1          |        256         |    2^29    |   4.197 ms |       0.33% |   4.199 ms |       0.28% |   1.169 us |   0.03% |   PASS   |
|    U8     |  Consecutive   |         1          |        1024        |    2^29    |   2.492 ms |       0.42% |   2.493 ms |       0.43% |   1.299 us |   0.05% |   PASS   |
|    U8     |  Consecutive   |         1          |       65536        |    2^29    |   1.834 ms |       0.95% |   1.830 ms |       0.87% |  -4.394 us |  -0.24% |   PASS   |
|    U8     |  Consecutive   |       65536        |       65536        |    2^29    |   1.649 ms |       1.66% |   1.642 ms |       1.29% |  -6.948 us |  -0.42% |   PASS   |
|    U32    |     Random     |         1          |         8          |    2^25    |   2.903 ms |       0.13% |   2.903 ms |       0.11% |  -0.616 us |  -0.02% |   PASS   |
|    U32    |     Random     |         1          |         64         |    2^25    | 724.012 us |       0.51% | 724.579 us |       0.47% |   0.567 us |   0.08% |   PASS   |
|    U32    |     Random     |         1          |        256         |    2^25    | 392.841 us |       0.77% | 391.940 us |       0.70% |  -0.901 us |  -0.23% |   PASS   |
|    U32    |     Random     |         1          |        1024        |    2^25    | 326.118 us |       0.80% | 325.245 us |       0.75% |  -0.873 us |  -0.27% |   PASS   |
|    U32    |     Random     |         1          |       65536        |    2^25    | 201.195 us |       1.01% | 200.773 us |       0.96% |  -0.422 us |  -0.21% |   PASS   |
|    U32    |     Random     |       65536        |       65536        |    2^25    | 107.419 us |       1.04% | 107.276 us |       1.17% |  -0.143 us |  -0.13% |   PASS   |
|    U32    |     Random     |         1          |         8          |    2^27    |  13.525 ms |       0.90% |  13.514 ms |       0.08% | -11.290 us |  -0.08% |   FAIL   |
|    U32    |     Random     |         1          |         64         |    2^27    |   3.409 ms |       0.18% |   3.409 ms |       0.16% |  -0.049 us |  -0.00% |   PASS   |
|    U32    |     Random     |         1          |        256         |    2^27    |   1.705 ms |       0.28% |   1.704 ms |       0.24% |  -0.721 us |  -0.04% |   PASS   |
|    U32    |     Random     |         1          |        1024        |    2^27    | 788.582 us |       0.73% | 788.906 us |       0.72% |   0.323 us |   0.04% |   PASS   |
|    U32    |     Random     |         1          |       65536        |    2^27    | 530.925 us |       1.81% | 530.575 us |       1.62% |  -0.350 us |  -0.07% |   PASS   |
|    U32    |     Random     |       65536        |       65536        |    2^27    | 405.204 us |       2.01% | 405.649 us |       1.94% |   0.446 us |   0.11% |   PASS   |
|    U32    |     Random     |         1          |         8          |    2^29    |  57.028 ms |       0.50% |  56.960 ms |       0.07% | -68.506 us |  -0.12% |   FAIL   |
|    U32    |     Random     |         1          |         64         |    2^29    |  14.206 ms |       0.06% |  14.207 ms |       0.06% |   1.451 us |   0.01% |   PASS   |
|    U32    |     Random     |         1          |        256         |    2^29    |   6.985 ms |       0.14% |   6.987 ms |       0.13% |   1.678 us |   0.02% |   PASS   |
|    U32    |     Random     |         1          |        1024        |    2^29    |   3.236 ms |       0.32% |   3.234 ms |       0.34% |  -2.147 us |  -0.07% |   PASS   |
|    U32    |     Random     |         1          |       65536        |    2^29    |   1.803 ms |       0.87% |   1.800 ms |       0.81% |  -3.331 us |  -0.18% |   PASS   |
|    U32    |     Random     |       65536        |       65536        |    2^29    |   1.617 ms |       1.26% |   1.611 ms |       1.07% |  -5.733 us |  -0.35% |   PASS   |
|    U32    |  Consecutive   |         1          |         8          |    2^25    | 385.902 us |       0.49% | 385.875 us |       0.49% |  -0.027 us |  -0.01% |   PASS   |
|    U32    |  Consecutive   |         1          |         64         |    2^25    | 246.558 us |       0.88% | 246.857 us |       0.86% |   0.298 us |   0.12% |   PASS   |
|    U32    |  Consecutive   |         1          |        256         |    2^25    | 286.994 us |       0.98% | 286.815 us |       0.86% |  -0.179 us |  -0.06% |   PASS   |
|    U32    |  Consecutive   |         1          |        1024        |    2^25    | 296.304 us |       0.92% | 292.909 us |       0.82% |  -3.395 us |  -1.15% |   FAIL   |
|    U32    |  Consecutive   |         1          |       65536        |    2^25    | 200.824 us |       1.22% | 200.875 us |       1.22% |   0.052 us |   0.03% |   PASS   |
|    U32    |  Consecutive   |       65536        |       65536        |    2^25    | 109.920 us |       1.46% | 109.924 us |       1.38% |   0.004 us |   0.00% |   PASS   |
|    U32    |  Consecutive   |         1          |         8          |    2^27    |   1.461 ms |       0.19% |   1.462 ms |       0.17% |   0.516 us |   0.04% |   PASS   |
|    U32    |  Consecutive   |         1          |         64         |    2^27    | 838.718 us |       0.28% | 838.587 us |       0.27% |  -0.130 us |  -0.02% |   PASS   |
|    U32    |  Consecutive   |         1          |        256         |    2^27    |   1.052 ms |       0.58% |   1.052 ms |       0.54% |   0.065 us |   0.01% |   PASS   |
|    U32    |  Consecutive   |         1          |        1024        |    2^27    | 601.856 us |       0.71% | 603.372 us |       0.71% |   1.516 us |   0.25% |   PASS   |
|    U32    |  Consecutive   |         1          |       65536        |    2^27    | 527.037 us |       1.22% | 526.328 us |       1.10% |  -0.709 us |  -0.13% |   PASS   |
|    U32    |  Consecutive   |       65536        |       65536        |    2^27    | 405.829 us |       1.91% | 404.364 us |       1.87% |  -1.465 us |  -0.36% |   PASS   |
|    U32    |  Consecutive   |         1          |         8          |    2^29    |   5.775 ms |       0.74% |   5.771 ms |       0.08% |  -4.083 us |  -0.07% |   PASS   |
|    U32    |  Consecutive   |         1          |         64         |    2^29    |   3.222 ms |       0.18% |   3.214 ms |       0.23% |  -8.087 us |  -0.25% |   FAIL   |
|    U32    |  Consecutive   |         1          |        256         |    2^29    |   4.048 ms |       0.35% |   4.050 ms |       0.35% |   1.800 us |   0.04% |   PASS   |
|    U32    |  Consecutive   |         1          |        1024        |    2^29    |   2.442 ms |       0.44% |   2.443 ms |       0.45% |   0.524 us |   0.02% |   PASS   |
|    U32    |  Consecutive   |         1          |       65536        |    2^29    |   1.807 ms |       0.90% |   1.805 ms |       0.81% |  -2.626 us |  -0.15% |   PASS   |
|    U32    |  Consecutive   |       65536        |       65536        |    2^29    |   1.638 ms |       1.42% |   1.631 ms |       1.09% |  -7.334 us |  -0.45% |   PASS   |

mfbalin · 2023-04-14T18:16:45Z

When I use a fancy output iterator, the test doesn't do anything at all, do you have any idea why that may be?

Problem solved: It was because the lines I copied from had different sets of arguments in the two calls to devicememcpy::batched, so the two calls calculated different amounts of temporary storage bytes since argument types were different.

cub/agent/agent_batch_memcpy.cuh

cub/device/dispatch/dispatch_batch_memcpy.cuh

cub/agent/agent_batch_memcpy.cuh

gevtushenko

Thank you for the contribution! I've left a few minor comments below. Once those are addressed, I'll start testing.

gevtushenko · 2023-04-19T09:34:11Z

cub/device/device_copy.cuh

+
+/**
+ * @brief cub::DeviceCopy provides device-wide, parallel operations for copying data.
+ * \ingroup SingleModule


No action required, but one can argue that this facility is currently closer to SegmentedModule. Since DeviceCopy will potentially contain non-segmented/batched methods, I think the strict separation into Single and Segmented modules in our docs should go away as part of the new documentation system.

test/test_device_batch_copy.cu

cub/agent/agent_batch_memcpy.cuh

test/test_device_batch_copy.cu

cub/agent/agent_batch_memcpy.cuh

mfbalin · 2023-04-19T13:46:31Z

Thank you for the contribution! I've left a few minor comments below. Once those are addressed, I'll start testing.

I must have addressed all your reviews with the last commit. Thanks for the review, this is a learning opportunity to write production-quality code.

gevtushenko

Thank you for addressing the review notes! We can start testing now. Please, make sure to squash your commits before merging.

add DeviceCopy algorithm interface, using memcpy for now. switch from byte copy logic to item logic adding a test change template names and better static_assert msg remove cast to pointer from output iterator also change to using ::cuda::std apply clang-format address reviews update example with runlengthdecode update the test and fix compile error refactor vectorized copy fix a bug refactor read_byte function also fix minor comment in code snippet fix the test, now works and passes refactor the test a bit more address another round of reviews use fancy output iterator, test broken know though replace buffers with ranges in the test fix a typo fix a bug in the test add tuple copy test also change memcpy to IsMemcpy fix the void * issue with a level of indirection address reviews about complex types

elstehle · 2023-04-19T14:45:41Z

Thanks a lot for your contribution and addressing all the review comments, @mfbalin! We're now testing your PR.

mfbalin marked this pull request as draft April 11, 2023 02:47

gevtushenko requested a review from elstehle April 11, 2023 04:46

elstehle suggested changes Apr 12, 2023

View reviewed changes

cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved

cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved

cub/device/dispatch/dispatch_batch_memcpy.cuh Outdated Show resolved Hide resolved

mfbalin force-pushed the fancy_device_memcpy branch 2 times, most recently from 617853f to b5b20c8 Compare April 13, 2023 07:19

elstehle reviewed Apr 13, 2023

View reviewed changes

mfbalin force-pushed the fancy_device_memcpy branch 2 times, most recently from 25798e7 to a188aea Compare April 14, 2023 00:29

mfbalin force-pushed the fancy_device_memcpy branch from 3b3e1a1 to 4309942 Compare April 14, 2023 01:06

mfbalin requested a review from elstehle April 14, 2023 01:39

mfbalin marked this pull request as ready for review April 14, 2023 01:39

mfbalin force-pushed the fancy_device_memcpy branch from b181c27 to 997cc11 Compare April 14, 2023 01:52

elstehle reviewed Apr 14, 2023

View reviewed changes

mfbalin requested a review from elstehle April 17, 2023 11:32

jrhemstad reviewed Apr 17, 2023

View reviewed changes

cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved

elstehle reviewed Apr 18, 2023

View reviewed changes

cub/device/dispatch/dispatch_batch_memcpy.cuh Outdated Show resolved Hide resolved

elstehle reviewed Apr 18, 2023

View reviewed changes

cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved

gevtushenko suggested changes Apr 19, 2023

View reviewed changes

gevtushenko approved these changes Apr 19, 2023

View reviewed changes

mfbalin force-pushed the fancy_device_memcpy branch from 023b37c to 5523b9d Compare April 19, 2023 14:20

elstehle added a commit to elstehle/thrust that referenced this pull request Apr 19, 2023

Testing NVIDIA/cub#675

0242954

elstehle mentioned this pull request Apr 19, 2023

Testing NVIDIA/cub#675 NVIDIA/thrust#1925

Closed

elstehle approved these changes Apr 21, 2023

View reviewed changes

elstehle merged commit 63fc662 into NVIDIA:main Apr 21, 2023

elstehle mentioned this pull request Apr 29, 2023

Bump CUB. NVIDIA/thrust#1933

Merged

mfbalin mentioned this pull request May 4, 2023

[Performance][CUDA] Faster CSRToCOO dmlc/dgl#5648

Merged

6 tasks

elstehle mentioned this pull request Jun 28, 2023

Specialize DeviceMemcpy::Batched to also support iterators #674

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial implementation for fancy devicememcpybatch #675

Initial implementation for fancy devicememcpybatch #675

mfbalin commented Apr 11, 2023

elstehle left a comment

mfbalin commented Apr 12, 2023

mfbalin commented Apr 12, 2023 •

edited

Loading

elstehle commented Apr 13, 2023 •

edited

Loading

elstehle left a comment

mfbalin commented Apr 14, 2023 •

edited

Loading

elstehle left a comment

elstehle Apr 14, 2023

mfbalin Apr 18, 2023 •

edited

Loading

elstehle commented Apr 14, 2023 •

edited

Loading

mfbalin commented Apr 14, 2023 •

edited

Loading

gevtushenko left a comment

gevtushenko Apr 19, 2023

mfbalin commented Apr 19, 2023

gevtushenko left a comment

elstehle commented Apr 19, 2023

Initial implementation for fancy devicememcpybatch #675

Initial implementation for fancy devicememcpybatch #675

Conversation

mfbalin commented Apr 11, 2023

elstehle left a comment

Choose a reason for hiding this comment

mfbalin commented Apr 12, 2023

mfbalin commented Apr 12, 2023 • edited Loading

elstehle commented Apr 13, 2023 • edited Loading

elstehle left a comment

Choose a reason for hiding this comment

mfbalin commented Apr 14, 2023 • edited Loading

elstehle left a comment

Choose a reason for hiding this comment

elstehle Apr 14, 2023

Choose a reason for hiding this comment

mfbalin Apr 18, 2023 • edited Loading

Choose a reason for hiding this comment

elstehle commented Apr 14, 2023 • edited Loading

mfbalin commented Apr 14, 2023 • edited Loading

gevtushenko left a comment

Choose a reason for hiding this comment

gevtushenko Apr 19, 2023

Choose a reason for hiding this comment

mfbalin commented Apr 19, 2023

gevtushenko left a comment

Choose a reason for hiding this comment

elstehle commented Apr 19, 2023

mfbalin commented Apr 12, 2023 •

edited

Loading

elstehle commented Apr 13, 2023 •

edited

Loading

mfbalin commented Apr 14, 2023 •

edited

Loading

mfbalin Apr 18, 2023 •

edited

Loading

elstehle commented Apr 14, 2023 •

edited

Loading

mfbalin commented Apr 14, 2023 •

edited

Loading