Skip to content

Broadcast RPC Infrastructure Changes #758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 13, 2023

Conversation

rgmiller
Copy link
Contributor

This PR makes some basic changes to the coll_request struct and its associated functions to allow us to return data from broadcast RPC's.

Description

Specific changes to the struct:

  1. Adds variable to optionally disable the automatic cleanup of the coll_request struct once the RPC has been completed so that the calling thread has a chance to actually read the returned data
  2. Add a condition variable and associated mutex that the progress thread can signal to indicate that the all the child responses have been received.

Motivation and Context

In order to properly implement an "ls" utility, the server that is local to the 'ls' client needs to fetch file metadata from all the other servers. This is best handled by a broadcast RPC that all the other servers can respond to. However, this is the first broadcast RPC that requires recipients to reply with data and current implementation of the coll_request and its associated functions don't support this use case.

How Has This Been Tested?

Existing unit tests pass. (This doesn't test the new functionality, but it does at least offer some evidence that the changes haven't broken anything.)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Testing (addition of new tests or update to current tests)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the UnifyFS code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted.

Basic changes to the coll_request struct to allow us to return data from
broadcast RPC's.  Specifically:
1) Add a variable to optionally disable the automatic cleanup of the
   coll_request struct one the RPC has been completed so that the
   calling thread has a chance to actually read the returned data
2) Add a condition variable and associated mutex that the progress
   thread can signal to indicate that the all the child responses
   have been received.
@rgmiller
Copy link
Contributor Author

Note: This PR is NOT ready to be merged. In particular, I think I'm going to need to add another hg_bulk_t to the struct for the reply data. I'm creating the PR now so that people can start looking at the changes.

@adammoody
Copy link
Collaborator

Thanks, @rgmiller . Looks good to me.

@rgmiller
Copy link
Contributor Author

rgmiller commented Mar 3, 2023

There's one TODO comment I'd like someone to take a look at: unifyfs_group_rpc.c:307.

It's not related to code changes I've made; just something I noticed when I was mucking about nearby. Specifically, if we fail to acquire a child handle, we don't abort the function. We just set the handle to NULL and keep going. I'm not sure that's correct behavior.

@MichaelBrim
Copy link
Collaborator

Regarding the TODO, the reason we don't return immediately in that error case is because that would leak the allocated coll_request structure and its member array allocations. However, it looks like we're missing the logic to notice that we failed to get one of the child handles and need to cleanup the structure before returning NULL. Unfortunately, it appears it's not as easy as just calling collective_cleanup(), since that would prematurely free the original group rpc request handle, which we use in the callers to collective_create() to send back an error. This is complicated enough that we probably should open an issue and address in a separate PR.

Removing a "TODO" comment about error handling in collective_create().
The issue it raised isn't directly related to the broadcast RPC
infrastructure changes and fixing it is a complex enough problem to
warrant its own GitHub issue.
@adammoody adammoody merged commit cad1b62 into LLNL:dev Mar 13, 2023
@rgmiller rgmiller deleted the bcast_rpc_infrastructure branch August 1, 2023 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants