Broadcast RPC Infrastructure Changes #758

rgmiller · 2023-01-17T17:03:57Z

This PR makes some basic changes to the coll_request struct and its associated functions to allow us to return data from broadcast RPC's.

Description

Specific changes to the struct:

Adds variable to optionally disable the automatic cleanup of the coll_request struct once the RPC has been completed so that the calling thread has a chance to actually read the returned data
Add a condition variable and associated mutex that the progress thread can signal to indicate that the all the child responses have been received.

Motivation and Context

In order to properly implement an "ls" utility, the server that is local to the 'ls' client needs to fetch file metadata from all the other servers. This is best handled by a broadcast RPC that all the other servers can respond to. However, this is the first broadcast RPC that requires recipients to reply with data and current implementation of the coll_request and its associated functions don't support this use case.

How Has This Been Tested?

Existing unit tests pass. (This doesn't test the new functionality, but it does at least offer some evidence that the changes haven't broken anything.)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Testing (addition of new tests or update to current tests)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the UnifyFS code style requirements.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted.

Basic changes to the coll_request struct to allow us to return data from broadcast RPC's. Specifically: 1) Add a variable to optionally disable the automatic cleanup of the coll_request struct one the RPC has been completed so that the calling thread has a chance to actually read the returned data 2) Add a condition variable and associated mutex that the progress thread can signal to indicate that the all the child responses have been received.

rgmiller · 2023-01-17T17:07:10Z

Note: This PR is NOT ready to be merged. In particular, I think I'm going to need to add another hg_bulk_t to the struct for the reply data. I'm creating the PR now so that people can start looking at the changes.

adammoody · 2023-02-02T21:08:04Z

Thanks, @rgmiller . Looks good to me.

rgmiller · 2023-03-03T17:53:23Z

There's one TODO comment I'd like someone to take a look at: unifyfs_group_rpc.c:307.

It's not related to code changes I've made; just something I noticed when I was mucking about nearby. Specifically, if we fail to acquire a child handle, we don't abort the function. We just set the handle to NULL and keep going. I'm not sure that's correct behavior.

MichaelBrim · 2023-03-07T22:07:21Z

Regarding the TODO, the reason we don't return immediately in that error case is because that would leak the allocated coll_request structure and its member array allocations. However, it looks like we're missing the logic to notice that we failed to get one of the child handles and need to cleanup the structure before returning NULL. Unfortunately, it appears it's not as easy as just calling collective_cleanup(), since that would prematurely free the original group rpc request handle, which we use in the callers to collective_create() to send back an error. This is complicated enough that we probably should open an issue and address in a separate PR.

Removing a "TODO" comment about error handling in collective_create(). The issue it raised isn't directly related to the broadcast RPC infrastructure changes and fixing it is a complex enough problem to warrant its own GitHub issue.

Remove "TODO" comment

5986109

Removing a "TODO" comment about error handling in collective_create(). The issue it raised isn't directly related to the broadcast RPC infrastructure changes and fixing it is a complex enough problem to warrant its own GitHub issue.

rgmiller mentioned this pull request Mar 10, 2023

Probable Incorrect Error Handling in collective_create() #761

Open

MichaelBrim approved these changes Mar 13, 2023

View reviewed changes

adammoody merged commit cad1b62 into LLNL:dev Mar 13, 2023

rgmiller deleted the bcast_rpc_infrastructure branch August 1, 2023 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broadcast RPC Infrastructure Changes #758

Broadcast RPC Infrastructure Changes #758

rgmiller commented Jan 17, 2023

rgmiller commented Jan 17, 2023

adammoody commented Feb 2, 2023

rgmiller commented Mar 3, 2023

MichaelBrim commented Mar 7, 2023

Broadcast RPC Infrastructure Changes #758

Broadcast RPC Infrastructure Changes #758

Conversation

rgmiller commented Jan 17, 2023

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

rgmiller commented Jan 17, 2023

adammoody commented Feb 2, 2023

rgmiller commented Mar 3, 2023

MichaelBrim commented Mar 7, 2023