Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved GPU Copy #1976

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

philip-paul-mueller
Copy link
Collaborator

Before some 2D copies (especially if they had FORTRAN order) were turned into Maps, see issue#1953.
This PR modifies the code generator in such a way that such copies are now handled.

There is some legacy stuff that should also be looked at.

philip-paul-mueller and others added 16 commits March 14, 2025 10:18
But I have to run the unit tests.
…D copy to a 1d copy, because it happens to be continiously allocated.
…date_state()` function.

The reason is that in some cases this is valid, for example if an edge connects an AccessNode and a MapEntry, because, in that case the map might not be executed.
Since the Memlet does not have access to its source and destination node it can not check that, so the test was moved to a location that can do this check.
However, it only does the check for AN to AN connections, which is a bit restrictive, but this is something for later.
philip-paul-mueller added a commit to philip-paul-mueller/gt4py that referenced this pull request Mar 17, 2025
Once [PR#1976](spcl/dace#1976) is merged in DaCe the code generator is able to handle more Memlets directly as Cuda `memcpy()` calls.
This PR modifies the GPU transformation of GT4Py in such a way that these Memlets are no longer transformed into Maps.

However, it can only be merged if the DaCe dependency was bumped to a version that includes PR#1976!
@phschaad
Copy link
Collaborator

This looks good to me, but would be great if @alexnick83 could quickly sanity check - GPU codegen is not my forté

Copy link
Contributor

@alexnick83 alexnick83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions/comments, namely the following:

  • I share the concern that, for ndim > 2, the conditions tested are insufficient. I would suggest making an issue to revisit in the future.
  • I am seeing an issue in _emit_copy, for ndim == 2, around lines 1014-1015. Maybe have a look and confirm that this is intended.

I see that we were already referring to these copies as "continuous," but the correct term is "contiguous." Could you do a find and replace-all?

# If we do not have them, then we have to turn this into a Map.
is_fortran_order = src_strides[0] == 1 and dst_strides[0] == 1
is_c_order = src_strides[-1] == 1 and dst_strides[-1] == 1
if is_c_order or is_fortran_order:
continue
elif dims > 2:
if not (src_strides[-1] != 1 or dst_strides[-1] != 1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this now, is this condition sufficient? What is there (multiple) discontinuities in other higher dimensions? I would leave a note to check later if any strange multidimensional copies are covered by _emit_copy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is correct, however, in your had you have to apply de Morgans law twice.
I have reformulated the condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that just having the last strides be 1 in both arrays when ndim > 2 is not enough to assume that we do not need a Map. But this is how the code was already, so I would leave the exploration of such cases for another PR.

raise NotImplementedError('2D copy only supported with one stride')
src_strides = [src_strides[1 if is_c_order else 0]]
dst_strides = [dst_strides[1 if is_c_order else 0]]
is_fortran_order = src_strides[0] == 1 and dst_strides[0] == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this already computed in line 983?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See next point.

src_strides = [src_strides[1 if is_c_order else 0]]
dst_strides = [dst_strides[1 if is_c_order else 0]]
is_fortran_order = src_strides[0] == 1 and dst_strides[0] == 1
is_c_order = is_fortran_order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expression missing a not?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This here is very strange.
This code seems to check (I am not sure myself because there is no explanation and I am very bad at reading the minds of people from several years ago) if a 2d copy can be reduced to a 1d copy, for example there is no padding in neither the source and destination arrays.
This means a 1d copy is performed, although a 2d object is copied.
Thus the difference between FORTRAN and C order is meaningless, thus they have the same value and indicate if the stride is 1.
If you look at the initial definition of is_{fortran, c}_order (on line ~984), you will see that this is the same behaviour you had if the copy was 1d in the first place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now. The array will initially be marked as C/Fortran ordered based on the last/first stride. However, in the special case ndim=2, the other stride is checked to verify whether the array is indeed contiguous.

if src_strides[-1] != 1 or dst_strides[-1] != 1:
# Currently we only support ND copies when they can be represented
# as a 1D copy or as a 2D strided copy
# NOTE: Not sure if this test is enough, it should also be tested that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To what are you agreeing on?

  • That you are not sure what this code is doing.
  • That you agree that it is meaningless and can be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That the test is not enough. it doesn't consider, e.g., Views where you can have multiple discontinuities, and I guess it is like that because the original code predates Views. I also think it is fine to leave it as-is for now, but it should be noted down as a possible source of errors.

state_id,
[src_node, dst_node],
)
elif dims == 2 and is_fortran_order:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering lines 1014-1015, will this condition ever trigger, or will everything be caught by the previous one (line 1113)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my explanation above.
Line 1014-1015 are only executed if a 2d copy can be expressed (known at code generation time) as a 1d copy.
See also the discussion above about if we can remove these 2d -> 1d copy changer because the memlet_copy_to_absolute_strides() function is doing it anyway.

if src_strides[-1] != 1 or dst_strides[-1] != 1:
# Currently we only support ND copies when they can be represented
# as a 1D copy or as a 2D strided copy
# NOTE: Not sure if this test is enough, it should also be tested that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That the test is not enough. it doesn't consider, e.g., Views where you can have multiple discontinuities, and I guess it is like that because the original code predates Views. I also think it is fine to leave it as-is for now, but it should be noted down as a possible source of errors.

src_strides = [src_strides[1 if is_c_order else 0]]
dst_strides = [dst_strides[1 if is_c_order else 0]]
is_fortran_order = src_strides[0] == 1 and dst_strides[0] == 1
is_c_order = is_fortran_order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now. The array will initially be marked as C/Fortran ordered based on the last/first stride. However, in the special case ndim=2, the other stride is checked to verify whether the array is indeed contiguous.

Copy link
Collaborator

@tbennun tbennun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I wrote the original code, I want to make sure I review it properly before it’s in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants