-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved GPU Copy #1976
base: main
Are you sure you want to change the base?
Improved GPU Copy #1976
Conversation
But I have to run the unit tests.
…D copy to a 1d copy, because it happens to be continiously allocated.
…not where I though I found it.
…date_state()` function. The reason is that in some cases this is valid, for example if an edge connects an AccessNode and a MapEntry, because, in that case the map might not be executed. Since the Memlet does not have access to its source and destination node it can not check that, so the test was moved to a location that can do this check. However, it only does the check for AN to AN connections, which is a bit restrictive, but this is something for later.
Once [PR#1976](spcl/dace#1976) is merged in DaCe the code generator is able to handle more Memlets directly as Cuda `memcpy()` calls. This PR modifies the GPU transformation of GT4Py in such a way that these Memlets are no longer transformed into Maps. However, it can only be merged if the DaCe dependency was bumped to a version that includes PR#1976!
This looks good to me, but would be great if @alexnick83 could quickly sanity check - GPU codegen is not my forté |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some questions/comments, namely the following:
- I share the concern that, for ndim > 2, the conditions tested are insufficient. I would suggest making an issue to revisit in the future.
- I am seeing an issue in
_emit_copy
, for ndim == 2, around lines 1014-1015. Maybe have a look and confirm that this is intended.
I see that we were already referring to these copies as "continuous," but the correct term is "contiguous." Could you do a find and replace-all?
dace/codegen/targets/cuda.py
Outdated
# If we do not have them, then we have to turn this into a Map. | ||
is_fortran_order = src_strides[0] == 1 and dst_strides[0] == 1 | ||
is_c_order = src_strides[-1] == 1 and dst_strides[-1] == 1 | ||
if is_c_order or is_fortran_order: | ||
continue | ||
elif dims > 2: | ||
if not (src_strides[-1] != 1 or dst_strides[-1] != 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this now, is this condition sufficient? What is there (multiple) discontinuities in other higher dimensions? I would leave a note to check later if any strange multidimensional copies are covered by _emit_copy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is correct, however, in your had you have to apply de Morgans law twice.
I have reformulated the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that just having the last strides be 1 in both arrays when ndim > 2 is not enough to assume that we do not need a Map. But this is how the code was already, so I would leave the exploration of such cases for another PR.
raise NotImplementedError('2D copy only supported with one stride') | ||
src_strides = [src_strides[1 if is_c_order else 0]] | ||
dst_strides = [dst_strides[1 if is_c_order else 0]] | ||
is_fortran_order = src_strides[0] == 1 and dst_strides[0] == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this already computed in line 983?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See next point.
src_strides = [src_strides[1 if is_c_order else 0]] | ||
dst_strides = [dst_strides[1 if is_c_order else 0]] | ||
is_fortran_order = src_strides[0] == 1 and dst_strides[0] == 1 | ||
is_c_order = is_fortran_order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this expression missing a not
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This here is very strange.
This code seems to check (I am not sure myself because there is no explanation and I am very bad at reading the minds of people from several years ago) if a 2d copy can be reduced to a 1d copy, for example there is no padding in neither the source and destination arrays.
This means a 1d copy is performed, although a 2d object is copied.
Thus the difference between FORTRAN and C order is meaningless, thus they have the same value and indicate if the stride is 1.
If you look at the initial definition of is_{fortran, c}_order
(on line ~984), you will see that this is the same behaviour you had if the copy was 1d in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now. The array will initially be marked as C/Fortran ordered based on the last/first stride. However, in the special case ndim=2, the other stride is checked to verify whether the array is indeed contiguous.
if src_strides[-1] != 1 or dst_strides[-1] != 1: | ||
# Currently we only support ND copies when they can be represented | ||
# as a 1D copy or as a 2D strided copy | ||
# NOTE: Not sure if this test is enough, it should also be tested that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To what are you agreeing on?
- That you are not sure what this code is doing.
- That you agree that it is meaningless and can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That the test is not enough. it doesn't consider, e.g., Views where you can have multiple discontinuities, and I guess it is like that because the original code predates Views. I also think it is fine to leave it as-is for now, but it should be noted down as a possible source of errors.
state_id, | ||
[src_node, dst_node], | ||
) | ||
elif dims == 2 and is_fortran_order: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering lines 1014-1015, will this condition ever trigger, or will everything be caught by the previous one (line 1113)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my explanation above.
Line 1014-1015 are only executed if a 2d copy can be expressed (known at code generation time) as a 1d copy.
See also the discussion above about if we can remove these 2d -> 1d copy changer because the memlet_copy_to_absolute_strides()
function is doing it anyway.
if src_strides[-1] != 1 or dst_strides[-1] != 1: | ||
# Currently we only support ND copies when they can be represented | ||
# as a 1D copy or as a 2D strided copy | ||
# NOTE: Not sure if this test is enough, it should also be tested that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That the test is not enough. it doesn't consider, e.g., Views where you can have multiple discontinuities, and I guess it is like that because the original code predates Views. I also think it is fine to leave it as-is for now, but it should be noted down as a possible source of errors.
src_strides = [src_strides[1 if is_c_order else 0]] | ||
dst_strides = [dst_strides[1 if is_c_order else 0]] | ||
is_fortran_order = src_strides[0] == 1 and dst_strides[0] == 1 | ||
is_c_order = is_fortran_order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now. The array will initially be marked as C/Fortran ordered based on the last/first stride. However, in the special case ndim=2, the other stride is checked to verify whether the array is indeed contiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I wrote the original code, I want to make sure I review it properly before it’s in
Before some 2D copies (especially if they had FORTRAN order) were turned into Maps, see issue#1953.
This PR modifies the code generator in such a way that such copies are now handled.
There is some legacy stuff that should also be looked at.