-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce new properties for USM allocations #17539
base: sycl
Are you sure you want to change the base?
Conversation
This is an RFC for a set of new properties serving as hints to USM allocation functions. The current SYCL 2020 spec declares all allocation functions as accepting a property list argument, but so far the semantics of these values remain unspecified. We take advantage of it here by specifying a set of properties that can be optionally used by the implementation to inform the underlying allocator of specific use patterns. It's in many ways analogous to posix's `madvise` and the extant SYCL madvise extension, but since the flags are specified at allocation, it's potentially more convenient and efficient for the underlying backend to effectively utilize. In addition to the flags already supported by most madvise implementations we also introduce several flags useful on accelerators that for some workloads may enable significant performance benefits. e.g. `device_cache_noncoherent` has been seen to enable a doubling of performance in some workloads with `malloc_host`, where the accelerator and host don't need to agree on the contents of an allocation until after a `.wait()` - a very common idiom. This extension was developed for the Quicksilver nuclear simulation benchmark which significantly benefits from the `device_cache_non_coherent` flag mentioned above to a tune of ~160%/ Documentation is in doc/extensions/proposed/sycl_ext_usm_properties.asciidoc.
Following the specification for `sycl_ext_codeplay_usm_props` in DPC++ introduce flags to Unified Runtime to enable support for this feature in various backends.
Teach the CUDA adapter how to extract and pass on allocation flags for USM allocations.
Teach the HIP adapter to extract and pass on USM property flags for coherence settings on USM allocations.
The Intel OpenCL adapater supports setting a write-combining flag on SVM allocations. This can now be utilized by `sycl_ext_codeplay_usm_props`.
Use `madvise`/`posix_madvise` on Linux/POSIX systems to enable UR to support `sycl_ext_codeplay_usm_props` for USM allocations.
This seems to have hung on BMG so I cancelled CI |
#include <umf/memory_pool.h> | ||
|
||
template <typename AllocMixin, typename FlagTy = unsigned, size_t ReserveSz = 4> | ||
class FlagsMemProviderWithDefaultBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can consider extending https://github.com/intel/llvm/blob/sycl/unified-runtime/source/common/ur_pool_manager.hpp with flags and using it instead of the FlagsProvider. However, it would require some changes in how pools are created in CUDA (which might be needed anyway to fix the issue described in #17411 (comment)).
Also, having the support in pool manager, would make it easier to enable this for L0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this up, this looks like it was a lot of work!
I can see that you've already implemented this, but it feels to me like the wrong solution to the problem. In addition to the minor things I've flagged as review comments, there are two pretty major things here that I don't really like.
The first is that this seems to be a very complicated solution, and the new capabilities are already provided by SYCL in simpler ways. For example, creating an allocation that is local to the device but which then transfers back to the host upon kernel completion is exactly what an accessor
does. Did your performance experiments test whether you could achieve the same/similar performance result in Quicksilver using accessors?
The second is that the changes to the memory model make me very uncomfortable. The code which consumes a pointer often has no idea how the pointer was created, and so the changes you propose here are very dangerous. Any code using atomics or fences could break in unexpected ways whenever these properties are added to an allocation, and I would expect such problems would be almost impossible to debug.
I think that you have added these memory model changes because your extension assumes that host allocations behave a certain way (i.e., you assume that anything written to a host allocation must be written directly to host memory, over PCIe or similar). I think that assumption is incorrect -- it may be true that many implementations behave that way, but SYCL has a weak coherency model, and writes to host allocations are not guaranteed to be visible until the host and device synchronize (e.g., because one atomic synchronizes with another, or because of a call to wait()
). I might be wrong, but it seems like a much simpler solution to your problem would be to flip the semantics: host allocations could be cached on the device by default, and only written directly back to host memory if a specific property is specified.
ordering constraints and improve throughput among work-item invocations in | ||
some configurations. | ||
- `non_coherent`: The host and accelerator are not required to have a coherent | ||
view of the shared data at any time. This enables the device to agressively |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
view of the shared data at any time. This enables the device to agressively | |
view of the shared data at any time. This enables the device to aggressively |
1. have a corresponding `sycl::is_property<>` class template specialization | ||
inheriting from `std::true_type` | ||
2. define an `is_property_v<> constexpr inline` global, | ||
a 'la https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#table.members.propertyinterface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a 'la https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#table.members.propertyinterface | |
consistent with the property interface defined in [Section 4.5.4.1](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#table.members.propertyinterface) |
}; | ||
---- | ||
| | ||
This allocation is very popular and is regularly accessed by the host. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "very popular" is too colloquial, and I'm not even sure I know what it means in this context.
The fact that there's an "and" here makes it seem like being popular is different to describing frequency of access. Would it be just as accurate to say "This allocation is accessed frequently by the host"?
Same comment applies to other properties below.
}; | ||
---- | ||
| | ||
Antonymous with `host_hot`: The host rarely accesses this memory and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Antonymous with `host_hot`: The host rarely accesses this memory and | |
The host rarely accesses this memory and |
I don't think this adds anything. Same comment for other properties.
it is not likely beneficial to ensure good locality to the host. is | ||
seldom used. For example it may be used to store the result of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "is seldom used" here looks like a holdover from a previous draft. I'm not sure which wording you'd prefer to use.
Same comment for other properties.
may force a weak order on the data so-written. Thus, fences may be | ||
required to ensure correct behaviour across the host/device divide. | ||
Atomic read-modify-write operations to such a buffer are undefined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this.
My mental model is that fences are always required to reason about the ordering of data access across host and device, so I'm not sure what is actually changing here. And if fences guarantee correct behavior, why are atomic read-modify-write operations undefined?
to generate prefetch instructions during translation such as x86's | ||
`PREFETCHNTA` which hints to speculatively bring data into the | ||
processor's cache that it is then already available when it is needed. | ||
This property does not affect correctness in all cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be read as "There are some cases where this property affects correctness". Did you mean: "This property never affects correctness."?
Same comment for other properties.
The host only reads from this allocation. This may be useful in cases | ||
where results are communicated one-way from the executing kernel | ||
context the kernel builds a result in an allocation piecemeal, which | ||
is then only on read by the host. Note that this is already present | ||
in other extensions but is added here for completeness |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if somebody specifies host_read_only
and then writes to the pointer?
The host will only write to this allocation. n.b this is not a | ||
permissions bit, but a hint to the optimizer or the runtime. For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think describing this in terms of a "permissions bit" is the right terminology, because it doesn't exist anywhere else in SYCL. I think simply stating that this is a hint is enough.
avoid local caching of stores through the given pointer by writing | ||
them directly to main memory. Combined with a non-coherent property it | ||
may enable bypass of low-level caches and write around into a higher | ||
level etc. Hint only. No semantic changes for conforming programs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm calling this out here, but it really applies to the whole specification. There's a lot of inconsistency in the way that these properties are specified. Different properties say:
- "This is a hint"
- "This doesn't affect correctness"
- "Hint only"
- "No semantic changes for conforming programs"
I don't have a strong preference which wording you pick, but I think you should pick one way of saying "This is a hint that doesn't impact correctness" and use it everywhere.
Thanks for the interesting perspective John. I think there's quite a lot of Thoughts inline, below. On Thu Mar 20, 2025 at 10:18 AM GMT, John Pennycook wrote:
I agree with you in principle, but in practice both SYCL and core C++ provide Many algorithms (tree-based in particular) can be easier to implement with raw I think the current proposal is the most straightfoward Since we already have the property list interfaces available for all the SYCL All that said, I'm looking at comparing buffer/accessor performance in
Absolutely. Pointer provenance is a tricky subject, and users must know what On balance however, the performance benefits can be wild, so the tradeoff The parts of this specification that change coherence and memory model semantics
The decisions I made to get to this point are, for sure influenced by PCIe This extension provides controls for both sides of the host device split, so |
I don't disagree. What I was trying to say is that buffers and accessors provide a better abstraction of one of the behaviors you described: if a user asked me how to create an allocation on the host, move it to the device, use it exclusively on the device, then copy it back to the host, I'd tell them to use a buffer. If they also needed to build complicated pointer-based structures on the host and transfer them to the device, that's when I'd steer them to USM. If we have a real use-case that requires both, I'm supportive of adding performance hints to USM pointers. But I feel strongly that they should be hints, and be as safe to use as possible.
I think To build on
Sorry, I'll try to be more clear. I've done a lot of work on SYCL's memory model, and I may be omitting some details that aren't obvious. Please let me know if there's anything in the below that I need to elaborate on. What I meant by "flip the semantics" is:
...assuming that the backends we care about can enable their non-coherent behavior in a way that doesn't break As I said above, the SYCL memory model does not say anything about whether or not devices cache allocations. So it seems reasonable to me that our default behavior should be to allow the device to cache the memory in write-back mode, and only write the results back to the host if either:
The purpose of the property would then be to handle the case where you know you want write-through behavior (e.g., because the allocation is write-only on the device). That seems like a much simpler mental model, to me. Does that make sense? |
This is an RFC for a set of new properties serving as hints to USM
allocation functions.
The current SYCL 2020 spec declares all allocation functions as
accepting a property list argument, but so far the semantics of these
values remain unspecified. We take advantage of it here by specifying a
set of properties that can be optionally used by the implementation to
inform the underlying allocator of specific use patterns. It's in many
ways analogous to posix's
madvise
and the extant SYCL madviseextension, but since the flags are specified at allocation, it's
potentially more convenient and efficient for the underlying backend to
effectively utilize.
In addition to the flags already supported by most madvise
implementations we also introduce several flags useful on accelerators
that for some workloads may enable significant performance benefits.
e.g.
device_cache_noncoherent
has been seen to enable a doubling ofperformance in some workloads with
malloc_host
, where the acceleratorand host don't need to agree on the contents of an allocation until
after a
.wait()
- a very common idiom. This extension was developedfor the Quicksilver nuclear simulation benchmark which significantly
benefits from the
device_cache_non_coherent
flag mentioned above to atune of ~160%/
Documentation is in
doc/extensions/proposed/sycl_ext_usm_properties.asciidoc.
Each implementation in Unified runtime is its own commit, which should hopefully make reviewers' lives easier.
With this initial patchset, some support for one or two interesting properties is introduced to CUDA, HIP, Intel OpenCL, and Native CPU on POSIX.