-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TL/UCP: Add all-reduce ring alogorithm #1082
Open
armratner
wants to merge
3
commits into
openucx:master
Choose a base branch
from
armratner:all_reduce_ring_new
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Can one of the admins verify this patch? |
Working on Gtest |
samnordmann
reviewed
Mar 5, 2025
Added the gtest,
|
04e5531
to
d8396c6
Compare
Signed-off-by: Armen Ratner <armeng@nvidia.com>
Signed-off-by: Armen Ratner <armeng@nvidia.com>
- Add tests for various data types and reduction operations - Test edge cases with non-power-of-two team sizes and odd message sizes - Test persistent operations for stability - Test with different memory types (HOST, CUDA, CUDA_MANAGED where available) Signed-off-by: Armen Ratner <armeng@nvidia.com>
75c3b77
to
97ddcc2
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What
This PR adds a new ring-based Allreduce algorithm (named
"ring"
) to the UCP transport layer within UCC. It introduces:allreduce_ring.c
implementing the ring-based method.Makefile.am
) to include the new file.allreduce.[ch]
), including a new enum valueUCC_TL_UCP_ALLREDUCE_ALG_RING
, new function prototypes, and references in the algorithm registration.allreduce_ring.c
that manages per-rank scratch buffers, chunk-based sending/receiving, and reduction.Why ?
A ring-based Allreduce can be more efficient for large message sizes, especially on relatively simple or homogeneous network topologies. It complements existing Allreduce algorithms (e.g., knomial, sliding window, DBT) by providing:
How ?
The ring algorithm splits the input data into chunks, then circulates these chunks around the ring of ranks. Each rank performs local partial reductions on received data and passes it along. The main changes include:
File Additions/Modifications:
allreduce_ring.c
: Implements the ring-based send/recv steps, in-place or out-of-place usage, and partial data reductions viaucc_dt_reduce
.Makefile.am
: Includes the new file in the build.allreduce.c/allreduce.h
: Adds the new"ring"
algorithm ID and associated function prototypes.Implementation Details:
num_chunks
, typically equal to the number of ranks. Each chunk is passed around the ring (sendto
/recvfrom
) and reduced in a scratch buffer.scratch
buffer is allocated per rank to hold incoming chunk data before reduction.Code Flow:
ucc_dt_reduce
on each incoming portion.By adding this ring-based approach, UCC gains a more complete suite of collective algorithms for Allreduce, allowing users and internal heuristics to pick the best method based on message size, topology, and system capabilities.