Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New and Improved MapFusion #1629

Merged
merged 230 commits into from
Feb 27, 2025
Merged

Conversation

philip-paul-mueller
Copy link
Collaborator

@philip-paul-mueller philip-paul-mueller commented Aug 22, 2024

This PR introduces a new and improved version of MapFusion.
A summary of the changes can also be found here, it compares the resulting SDFGs generated by the old and new transformation of some unit tests.

Fixed Bugs and removed Limitations

  • The subsets (not the .subset member of the Memlet; I mean the concept) of the new intermediate data descriptor were not computed correctly in some cases, especially in presence of offsets. See the test_offset_correction_range_read(), test_offset_correction_scalar_read() and the test_offset_correction_empty() tests.
  • Upon the propagation of the subsets, due to the changed intermediate, was not handled properly. Essentially, the transformation only updated .subset and ignored .other_subset. Which is correct in most cases but not always. See the test_fusion_intrinsic_memlet_direction() for more.
  • During the check if two maps could be fused the .dynamic property of the Memelts were fully ignored leading to wrong code.
  • The read-write conflict checks were refined, before all arrays needed to be accessed the wrong way, i.e. before a fusion was rejected when one map accessed A[i, j] and the other map was accessing B[i + 1, j]. Now this is possible as long as every access is point wise. See the test_fusion_different_global_accesses() test for an example.
  • The shape of the reduced intermediate is cleaned, i.e. unnecessary dimensions of size 1, are removed, except they were present in the original shape. To make an example, the intermediate array, T, had shape (10, 1, 20) and inside the map was accessed T[__i, 0, __j], then the old transformation would have created an reduced intermediate of shape (1, 1, 1), new its shape is (1). Note if the intermediate has shape (10, 20) instead and would be accessed as T[__i, __j] then a Scalar would have been created. See also the struct_dataflow flag below.

New Flags

  • only_toplevel_maps: If True the transformation will only fuse maps that are located at the top level, i.e. maps inside maps will not be merged.
  • only_inner_maps: If True then the transformation will only fuse maps that are inside other maps.
  • assume_always_shared: If True` then the transformation will assume that every intermediate is shared, i.e. the referenced data is used somewhere else in the SDFG and has to become an output of the fused maps. This will create dead data flow, but avoids a scan of the full SDFG.
  • strict_dataflow: This flag is enabled by default. It has two effects, first it will disable the cleaning of reduced intermediate storage. The second effect is more important as it will preserve a much stricter data flow. Most importantly, if the intermediate array is used downstream (this is not limited to the case that the array is the output of the second map) then the maps will not be fused together. This is mostly to work around some other bugs in DaCe, where other transformations failed to pink up the dependency. Note that the fused map would be correct, the problem are other transformations.

FullMapFusion

This PR also introduced the FullMapFusion pass, which makes use of the FindSingleUseData pass that was introduced in PR#1906.
The FullMapFusion applies MapFusion as long as possible, i.e. fuses all maps that can be fused.
But instead of scanning the SDFG every time an intermediate node has to be classified, i.e. can it be deleted or not, it is done once and then reused which will speed up fusion process as it will remove the need to traverse the full SDFG many times.
This new pass also replaced the direct application of MapFusion in auto_optimizer.

References

Collection of known issues in other transformation:

@philip-paul-mueller philip-paul-mueller changed the title Started with a first version of the map fusion stuff. New and Improved MapFusion Aug 22, 2024
@philip-paul-mueller philip-paul-mueller marked this pull request as draft August 22, 2024 13:55
Now using the 3.9 type hints.
When the function was fixing the innteriour of the second map, it did not remove the readiong.
It almost passes all fuction.
However, the one that needs renaming are not yet done.
…t in the input and output set.

However, it is very simple.
Before it was going to look for the memlet of the consumer or producer.
However, one should actually only look at the memlets that are adjacent to the scope node.
At least this is how the original worked.

I noticed this because of the `buffer_tiling_test.py::test_basic()` test.
I was not yet focused on maps that were nested and not multidimensional.
It seems that the transformation has some problems there.
Whet it now cheks for covering (i.e. if the information to exchange is enough) it will now no longer decend into the maps, but only inspect the first outgoing/incomming edges of the map entrie and exit.
I noticed that the other way was to restrictive, especially for map tiling.
Otherwise we can end up in recursion.
Before it was replacing the elimated variables by zero.
Which actually worked pretty good, but I have now changed that such that `offset()` is used.
I am not sure why I used `replace` in the first place, but I think that there was an issue.
However, I am not sure.
philip-paul-mueller and others added 26 commits February 14, 2025 09:11
In [`ed58523d9b52c21e2fb6f6c4012e8d5f5a096021`](GridTools/gt4py@937e894) we reenabled `autooptimize` and the test passed.
Now we are adding the `blas` tests back, then only the FORTRAN stuff is left.
This is the last tests that we can add.
Remember the [one](spcl@47babf6) without it passed.
And the [one](spcl@abe7051) that hadd all failed.
Now we will only have the numpy test.
I want a clear tripping point, so let's run it again.
After removing everything, except the numpy test, it passes, see spcl@2ba2bff.
So we will now add the fortran test again and it should fail.
It is not inside the fortran tests.
So let's add the many other single files.
This is quite strange.
Let's add the `blas` and `autooptimize` they were also triggering it last time.
Why was it working before, i.e. when all are present.
So let's add some stuff to see what happens.
So now we add some new of the stray python tests:
    add_edge_pair_test.py
    add_state_api_test.py
    argmax_test.py
    array_interface_test.py
    blockreduce_cudatest.py
    buffer_tiling_test.py
    callback_test.py
    call_sdfg_test.py
    chained_nested_tasklet_test.py
    chained_tasklet_test.py
    compile_sdfg_test.py
    config_test.py
    confres_test.py
    conftest.py
    consolidate_edges_test.py
    const_access_test.py
    constant_array_test.py
    consume_chunk_cond_test.py
    consume_test.py
    control_flow_test.py
    copynd_test.py
    cpp_tasklet_test.py
    cppunparse_test.py
    cr_complex_test.py
    cuda_block_test.py
    cuda_grid2d_test.py
    cuda_grid_test.py
    cuda_highdim_kernel_test.py
    cuda_smem2d_test.py
    cuda_smem_test.py
    custom_build_folder_test.py
    custom_reduce_test.py
    datadesc_test.py
    default_storage_test.py
    different_stride_test.py
    duplicate_arg_test.py
    duplicate_naming_test.py
    dynamic_sdfg_functions_test.py
    dynamic_tb_map_cudatest.py
    enumerator_test.py
    external_module.py
    external_module_test.py
    global_resolver_test.py
    graph_test.py
    half_cudatest.py
    halfvec_cudatest.py
    host_map_host_data_test.py
    ifchain_test.py
    implicit_sdfg_test.py
    indirection_test.py

The tests that are **not yet** added back again are:
    inline_chain_test.py
    inline_external_edges_test.py
    inline_noinput_test.py
    inline_noncontig_dim_test.py
    inline_nonsink_access_test.py
    inline_symbol_test.py
    inlining_test.py
    instrumentation_test.py
    intarg_test.py
    interstate_assignment_test.py
    kernel_fusion_cudatest.py
    lib_reuse_test.py
    local_inline_test.py
    map_dim_shuffle_test.py
    map_indirect_array_test.py
    mapreduce_test.py
    memlet_lifetime_validation_test.py
    memlet_propagation_decreasing_test.py
    memlet_propagation_squeezing_test.py
    memlet_propagation_test.py
    memlet_propagation_volume_test.py
    mlir_tasklet_test.py
    multi_inline_test.py
    multi_output_scope_test.py
    multiple_cr_test.py
    multiple_tasklet_test.py
    multiprogram_cudatest.py
    multistate_init_test.py
    multistream_copy_cudatest.py
    multistream_custom_cudatest.py
    multistream_kernel_cudatest.py
    ndloop_test.py
    nested_control_flow_test.py
    nested_cr_test.py
    nested_loop_test.py
    nested_reduce_test.py
    nested_sdfg_python_test.py
    nested_sdfg_scalar_test.py
    nested_sdfg_test.py
    nested_stream_test.py
    nested_strides_test.py
    nested_symbol_partial_test.py
    nested_symbol_test.py
    nested_vector_type_test.py
    nest_subgraph_test.py
    numpy_bool_input_test.py
    offset_stride_test.py
    openmp_test.py
    parallel_sections_test.py
I have now swapped the top level tests.
I.e. all  that were actived before are now disabled and all that were disabled before are now activated.
Let's hope that it fails.
When I run with all single test files then it fails, when I run with one part it (the one not listed in `SMALLER_FRACTION`) then it works.
And if I only add them that are listed in `SMALLER_FRACTION` it still fails.
So lets see if it is still fails if all are run.
It seems that it is a caching bug, because as the last commit is showing it just changes the signature and ow it works.
Now we have to make sure that everything again is activated.
I have now reverted my fix for the `test_sum()` issue.
So the CI will now fail.
However, if I then change the definition of `N` and `M` so lets say
```
N = 22
M = 33
```
and run the CI then again and it passes, then this is something going on.
I now made the change to `N := 11` and `M := 20` and I do not expect that it fails.
So we have now disabled it and implemented one fix that was suggested.
So we now do the call on our own.
Copy link
Contributor

@acalotoiu acalotoiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks Good to me now.

@phschaad phschaad added this pull request to the merge queue Feb 27, 2025
Merged via the queue into spcl:main with commit b74e62c Feb 27, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants