Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce new properties for USM allocations #17539

Open
wants to merge 6 commits into
base: sycl
Choose a base branch
from

Conversation

ldrumm
Copy link
Contributor

@ldrumm ldrumm commented Mar 19, 2025

This is an RFC for a set of new properties serving as hints to USM
allocation functions.

The current SYCL 2020 spec declares all allocation functions as
accepting a property list argument, but so far the semantics of these
values remain unspecified. We take advantage of it here by specifying a
set of properties that can be optionally used by the implementation to
inform the underlying allocator of specific use patterns. It's in many
ways analogous to posix's madvise and the extant SYCL madvise
extension, but since the flags are specified at allocation, it's
potentially more convenient and efficient for the underlying backend to
effectively utilize.

In addition to the flags already supported by most madvise
implementations we also introduce several flags useful on accelerators
that for some workloads may enable significant performance benefits.
e.g. device_cache_noncoherent has been seen to enable a doubling of
performance in some workloads with malloc_host, where the accelerator
and host don't need to agree on the contents of an allocation until
after a .wait() - a very common idiom. This extension was developed
for the Quicksilver nuclear simulation benchmark which significantly
benefits from the device_cache_non_coherent flag mentioned above to a
tune of ~160%/

Documentation is in
doc/extensions/proposed/sycl_ext_usm_properties.asciidoc.

Each implementation in Unified runtime is its own commit, which should hopefully make reviewers' lives easier.

With this initial patchset, some support for one or two interesting properties is introduced to CUDA, HIP, Intel OpenCL, and Native CPU on POSIX.

ldrumm added 6 commits March 19, 2025 17:12
This is an RFC for a set of new properties serving as hints to USM
allocation functions.

The current SYCL 2020 spec declares all allocation functions as
accepting a property list argument, but so far the semantics of these
values remain unspecified. We take advantage of it here by specifying a
set of properties that can be optionally used by the implementation to
inform the underlying allocator of specific use patterns. It's in many
ways analogous to posix's `madvise` and the extant SYCL madvise
extension, but since the flags are specified at allocation, it's
potentially more convenient and efficient for the underlying backend to
effectively utilize.

In addition to the flags already supported by most madvise
implementations we also introduce several flags useful on accelerators
that for some workloads may enable significant performance benefits.
e.g. `device_cache_noncoherent` has been seen to enable a doubling of
performance in some workloads with `malloc_host`, where the accelerator
and host don't need to agree on the contents of an allocation until
after a `.wait()` - a very common idiom. This extension was developed
for the Quicksilver nuclear simulation benchmark which significantly
benefits from the `device_cache_non_coherent` flag mentioned above to a
tune of ~160%/

Documentation is in
doc/extensions/proposed/sycl_ext_usm_properties.asciidoc.
Following the specification for `sycl_ext_codeplay_usm_props` in DPC++
introduce flags to Unified Runtime to enable support for this feature in
various backends.
Teach the CUDA adapter how to extract and pass on allocation flags for
USM allocations.
Teach the HIP adapter to extract and pass on USM property flags for
coherence settings on USM allocations.
The Intel OpenCL adapater supports setting a write-combining flag on SVM
allocations. This can now be utilized by `sycl_ext_codeplay_usm_props`.
Use `madvise`/`posix_madvise` on Linux/POSIX systems to enable UR to
support `sycl_ext_codeplay_usm_props` for USM allocations.
@ldrumm ldrumm requested review from a team as code owners March 19, 2025 17:22
@ldrumm ldrumm temporarily deployed to WindowsCILock March 19, 2025 17:48 — with GitHub Actions Inactive
@ldrumm ldrumm temporarily deployed to WindowsCILock March 19, 2025 17:48 — with GitHub Actions Inactive
@sarnex
Copy link
Contributor

sarnex commented Mar 19, 2025

This seems to have hung on BMG so I cancelled CI

#include <umf/memory_pool.h>

template <typename AllocMixin, typename FlagTy = unsigned, size_t ReserveSz = 4>
class FlagsMemProviderWithDefaultBase {
Copy link
Member

@igchor igchor Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can consider extending https://github.com/intel/llvm/blob/sycl/unified-runtime/source/common/ur_pool_manager.hpp with flags and using it instead of the FlagsProvider. However, it would require some changes in how pools are created in CUDA (which might be needed anyway to fix the issue described in #17411 (comment)).

Also, having the support in pool manager, would make it easier to enable this for L0.

Copy link
Contributor

@Pennycook Pennycook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up, this looks like it was a lot of work!

I can see that you've already implemented this, but it feels to me like the wrong solution to the problem. In addition to the minor things I've flagged as review comments, there are two pretty major things here that I don't really like.

The first is that this seems to be a very complicated solution, and the new capabilities are already provided by SYCL in simpler ways. For example, creating an allocation that is local to the device but which then transfers back to the host upon kernel completion is exactly what an accessor does. Did your performance experiments test whether you could achieve the same/similar performance result in Quicksilver using accessors?

The second is that the changes to the memory model make me very uncomfortable. The code which consumes a pointer often has no idea how the pointer was created, and so the changes you propose here are very dangerous. Any code using atomics or fences could break in unexpected ways whenever these properties are added to an allocation, and I would expect such problems would be almost impossible to debug.

I think that you have added these memory model changes because your extension assumes that host allocations behave a certain way (i.e., you assume that anything written to a host allocation must be written directly to host memory, over PCIe or similar). I think that assumption is incorrect -- it may be true that many implementations behave that way, but SYCL has a weak coherency model, and writes to host allocations are not guaranteed to be visible until the host and device synchronize (e.g., because one atomic synchronizes with another, or because of a call to wait()). I might be wrong, but it seems like a much simpler solution to your problem would be to flip the semantics: host allocations could be cached on the device by default, and only written directly back to host memory if a specific property is specified.

ordering constraints and improve throughput among work-item invocations in
some configurations.
- `non_coherent`: The host and accelerator are not required to have a coherent
view of the shared data at any time. This enables the device to agressively
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
view of the shared data at any time. This enables the device to agressively
view of the shared data at any time. This enables the device to aggressively

1. have a corresponding `sycl::is_property<>` class template specialization
inheriting from `std::true_type`
2. define an `is_property_v<> constexpr inline` global,
a 'la https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#table.members.propertyinterface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a 'la https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#table.members.propertyinterface
consistent with the property interface defined in [Section 4.5.4.1](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#table.members.propertyinterface)

};
----
|
This allocation is very popular and is regularly accessed by the host.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "very popular" is too colloquial, and I'm not even sure I know what it means in this context.

The fact that there's an "and" here makes it seem like being popular is different to describing frequency of access. Would it be just as accurate to say "This allocation is accessed frequently by the host"?

Same comment applies to other properties below.

};
----
|
Antonymous with `host_hot`: The host rarely accesses this memory and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Antonymous with `host_hot`: The host rarely accesses this memory and
The host rarely accesses this memory and

I don't think this adds anything. Same comment for other properties.

Comment on lines +368 to +369
it is not likely beneficial to ensure good locality to the host. is
seldom used. For example it may be used to store the result of a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "is seldom used" here looks like a holdover from a previous draft. I'm not sure which wording you'd prefer to use.

Same comment for other properties.

Comment on lines +464 to +466
may force a weak order on the data so-written. Thus, fences may be
required to ensure correct behaviour across the host/device divide.
Atomic read-modify-write operations to such a buffer are undefined.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this.

My mental model is that fences are always required to reason about the ordering of data access across host and device, so I'm not sure what is actually changing here. And if fences guarantee correct behavior, why are atomic read-modify-write operations undefined?

to generate prefetch instructions during translation such as x86's
`PREFETCHNTA` which hints to speculatively bring data into the
processor's cache that it is then already available when it is needed.
This property does not affect correctness in all cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be read as "There are some cases where this property affects correctness". Did you mean: "This property never affects correctness."?

Same comment for other properties.

Comment on lines +587 to +591
The host only reads from this allocation. This may be useful in cases
where results are communicated one-way from the executing kernel
context the kernel builds a result in an allocation piecemeal, which
is then only on read by the host. Note that this is already present
in other extensions but is added here for completeness
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if somebody specifies host_read_only and then writes to the pointer?

Comment on lines +621 to +622
The host will only write to this allocation. n.b this is not a
permissions bit, but a hint to the optimizer or the runtime. For
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think describing this in terms of a "permissions bit" is the right terminology, because it doesn't exist anywhere else in SYCL. I think simply stating that this is a hint is enough.

avoid local caching of stores through the given pointer by writing
them directly to main memory. Combined with a non-coherent property it
may enable bypass of low-level caches and write around into a higher
level etc. Hint only. No semantic changes for conforming programs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm calling this out here, but it really applies to the whole specification. There's a lot of inconsistency in the way that these properties are specified. Different properties say:

  • "This is a hint"
  • "This doesn't affect correctness"
  • "Hint only"
  • "No semantic changes for conforming programs"

I don't have a strong preference which wording you pick, but I think you should pick one way of saying "This is a hint that doesn't impact correctness" and use it everywhere.

@ldrumm
Copy link
Contributor Author

ldrumm commented Mar 24, 2025

Thanks for the interesting perspective John. I think there's quite a lot of
useful ideas for me to unpack, and a few iterations to getting this right, but I
should first clarify that Quicksilver is perhaps a red herring. The interfaces
here are designed to enable hints for all USM users - of which there are many.

Thoughts inline, below.

On Thu Mar 20, 2025 at 10:18 AM GMT, John Pennycook wrote:

@Pennycook requested changes on this pull request.

Thanks for writing this up, this looks like it was a lot of work!

I can see that you've already implemented this, but it feels to me like
the wrong solution to the problem. In addition to the minor things I've
flagged as review comments, there are two pretty major things here that
I don't really like.

The first is that this seems to be a very complicated solution, and the
new capabilities are already provided by SYCL in simpler ways. For
example, creating an allocation that is local to the device but which
then transfers back to the host upon kernel completion is exactly what
an accessor does. Did your performance experiments test whether you
could achieve the same/similar performance result in Quicksilver using
accessors?

I agree with you in principle, but in practice both SYCL and core C++ provide
multiple incompatible ways to handle memory allocation and access and users are
going to use them all. I personally
feel that buffers and accessors are a better model for a host/device split in
many cases, but - since SYCL already provides pointer-based USM - it's my
responsibility as implementer to make that feature as performant as possible.
"Just change your code" is a reasonable proposition for specific cases and
projects, but unless you suggest we deprecate USM from SYCL entirely, I think
it's still of net benefit to the user to expose knobs that may improve the
performance where users know more about their access patterns than is
expressible in the language proper.

Many algorithms (tree-based in particular) can be easier to implement with raw
pointers. Users migrating from CUDA, HIP, OpenCL with SVM extensions,
Apple Metal, and HSA may miss the fine control of memory subsystems that these
APIs provide.

I think the current proposal is the most straightfoward
way to provide such accesses using existing interfaces and types in a way that doesn't modify the core SYCL specification.

Since we already have the property list interfaces available for all the SYCL
USM allocation functions, I don't believe this is actually at-all complicated
for the user. For the maintainer, it's opt-in to what to support, since they're
all optional hints (you rightly point out some inconsistencies in the language I
use to describe that). The complexity of the implementation here is mostly an
artefact of multiple pieces of plumbing required to wire this up to the new
memory-pool systems in Unified Runtime's multiple adapters.

All that said, I'm looking at comparing buffer/accessor performance in
Quicksilver and I'll report back.

The second is that the changes to the memory model make me very
uncomfortable. The code which consumes a pointer often has no idea how
the pointer was created, and so the changes you propose here are very
dangerous. Any code using atomics or fences could break in unexpected
ways whenever these properties are added to an allocation, and I would
expect such problems would be almost impossible to debug.

Absolutely. Pointer provenance is a tricky subject, and users must know what
they're doing. C99's restrict keyword has this problem since there's nothing
stopping you passing restrict-qualified pointers in duplicate to interfaces
that require otherwise - or vice-versa. See for example memcpy vs memmove
bugs.

On balance however, the performance benefits can be wild, so the tradeoff
becomes worth it for the users who need it.

The parts of this specification that change coherence and memory model semantics
could do with some refinement, for sure. I'm with you that changing the memory
model of SYCL is potentially troublesome, and will take all the help I can get
to ensure this is worded in such a way that we guard against footguns, and
enable the user.

I think that you have added these memory model changes because your
extension assumes that host allocations behave a certain way (i.e., you
assume that anything written to a host allocation must be written
directly to host memory, over PCIe or similar). I think that assumption
is incorrect -- it may be true that many implementations behave that
way, but SYCL has a weak coherency model, and writes to host allocations
are not guaranteed to be visible until the host and device synchronize
(e.g., because one atomic synchronizes with another, or because of a
call to wait()). I might be wrong, but it seems like a much simpler
solution to your problem would be to flip the semantics: host
allocations could be cached on the device by default, and only written
directly back to host memory if a specific property is specified.

The decisions I made to get to this point are, for sure influenced by PCIe
transaction costs, but I didn't assume PCIe. I made a conscious effort to
consider what effect or benefit these flags might have on other architectures or
memory subsystem designs than grey-x86-box-with-NVIDIA/AMD GPU (e.g. see my
comment about Havard DSPs). As you mention, however, the reality is that a very
large number of current SYCL accelerators in wild are of the form of PCIe
/accelerator in a standard PC/server.

This extension provides controls for both sides of the host device split, so
I'm not sure what you mean about flipping semantics.

@Pennycook
Copy link
Contributor

"Just change your code" is a reasonable proposition for specific cases and projects, but unless you suggest we deprecate USM from SYCL entirely, I think it's still of net benefit to the user to expose knobs that may improve the performance where users know more about their access patterns than is expressible in the language proper.
...
Many algorithms (tree-based in particular) can be easier to implement with raw pointers. Users migrating from CUDA, HIP, OpenCL with SVM extensions, Apple Metal, and HSA may miss the fine control of memory subsystems that these APIs provide.

I don't disagree.

What I was trying to say is that buffers and accessors provide a better abstraction of one of the behaviors you described: if a user asked me how to create an allocation on the host, move it to the device, use it exclusively on the device, then copy it back to the host, I'd tell them to use a buffer. If they also needed to build complicated pointer-based structures on the host and transfer them to the device, that's when I'd steer them to USM.

If we have a real use-case that requires both, I'm supportive of adding performance hints to USM pointers. But I feel strongly that they should be hints, and be as safe to use as possible.

Absolutely. Pointer provenance is a tricky subject, and users must know what they're doing. C99's restrict keyword has this problem since there's nothing stopping you passing restrict-qualified pointers in duplicate to interfaces that require otherwise - or vice-versa. See for example memcpy vs memmove bugs.

I think restrict is a good example, but it's much safer than what you're proposing here. If a developer declares that a function accepts restrict pointers, then there's a clear contract, and it's the caller's responsibility to make sure they satisfy it.

To build on restrict as an analogy, what you're proposing would be like a malloc variant that guarantees it will only ever be accessed via the pointer it returns, and that no other pointer will ever point to an address in that allocation. There'd be no way for a user to know if any pointer in the code was safe, and all sorts of libraries would break.

This extension provides controls for both sides of the host device split, so I'm not sure what you mean about flipping semantics.

Sorry, I'll try to be more clear. I've done a lot of work on SYCL's memory model, and I may be omitting some details that aren't obvious. Please let me know if there's anything in the below that I need to elaborate on.

What I meant by "flip the semantics" is:

  • Clarify the default behavior of host allocations in DPC++, and allow implementations to cache them;
  • Remove the non-coherent property (because it would be the default behavior); and
  • Focus on providing a write-through property to hint that the device should not cache writes.

...assuming that the backends we care about can enable their non-coherent behavior in a way that doesn't break sycl::atomic_ref. (I'm not aware of any backends that enable you to declare that individual memory doesn't support atomics, but I might be missing something.)

As I said above, the SYCL memory model does not say anything about whether or not devices cache allocations. So it seems reasonable to me that our default behavior should be to allow the device to cache the memory in write-back mode, and only write the results back to the host if either:

  • The user explicitly requested a flush (e.g., via sycl::work_item_fence() or sycl::atomic_ref); or
  • The kernel finished and the user called wait().

The purpose of the property would then be to handle the case where you know you want write-through behavior (e.g., because the allocation is write-only on the device). That seems like a much simpler mental model, to me.

Does that make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants