Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16908 object: add client-side target pinging on update retry #16069

Closed

Conversation

karthjyojay
Copy link
Contributor

@karthjyojay karthjyojay commented Mar 10, 2025

In the case of update bulk transfers, the server targets may try to communicate with clients who have not yet established a connection with them. In this case, the connection operation hangs forever. In Google Parallelstore, this is an issue because we do not have control over the client firewalls. To deal with this issue, we will have the server tell the client that they cannot connect to the client due to this scenario and the client will ping the affected targets before retrying their update RPC.

The purpose of this PR is to introduce client-side logic to ping targets involved in an update RPC.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Copy link

github-actions bot commented Mar 10, 2025

Ticket title is 'Modify DAOS to use new mercury changes to implement improved firewall handling'
Status is 'Open'
https://daosio.atlassian.net/browse/DAOS-16908

@karthjyojay karthjyojay force-pushed the dev/karthj/firewall-simplification branch from 389fd52 to e334a8c Compare March 10, 2025 18:21
@karthjyojay karthjyojay changed the title DAOS-16908 object: add client-side target pinging on update retry due to firewall client unreachable error DAOS-16908 object: add client-side target pinging on update retry Mar 10, 2025
@karthjyojay karthjyojay marked this pull request as ready for review March 10, 2025 18:56
@karthjyojay karthjyojay requested review from a team as code owners March 10, 2025 18:56
@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-16069/2/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-16069/2/testReport/

@karthjyojay karthjyojay requested a review from a team as a code owner March 10, 2025 20:22
Copy link
Contributor

@mchaarawi mchaarawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH im not sure what is the purpose of this PR. there is no description in the PR itself and the ticket looks like an unrelated or maybe just a parent ticket.
This is probably the first step to be able to give more meaningful feedback

Comment on lines 1595 to 1600
if (DAOS_FAIL_CHECK(DAOS_CLIENT_UNREACHABLE) && obj_rpc_is_update(rpc)) {
/** Fault injection - client unreachable. */
D_INFO("enabled fault injection client unreachable");
rc = -DER_CLIENT_UNREACH;
goto out;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this PR just for fault injection?
I have missed where we can return that error code anywhere else from the server..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are other PRs in flight but the point was to get the client code working in parallel with those other changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a good candidate for a feature branch then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move to a feature branch, are you ok with moving to a feature branch after this PR is landed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving this, please reopen as needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can continue to use this branch as a baseline. For more features built on this, we can make new child PRs that are based on this branch as a baseline and then slowly merge onto this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main advantage of a feature branch is that you can land PRs to it incrementally (have to be reviewed and passing tests same way as it it would be for master). then once all PRs in, the branch can be just landed to master without needing to review all the changes again. provided it passes CI testing of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We made a feature branch and changed this PR to merge my dev branch into this feature branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you get a chance, @mchaarawi, can you please take a look again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mchaarawi @jolivier23 , sorry, it looks like for some reason Jenkins is not able to build my PR. Jeff told me the way around this is making a new PR. Please hold off on re-reviewing this while I get a new PR ready.

@jolivier23
Copy link
Contributor

TBH im not sure what is the purpose of this PR. there is no description in the PR itself and the ticket looks like an unrelated or maybe just a parent ticket. This is probably the first step to be able to give more meaningful feedback

@karthjyojay probably meant to make this a draft but yeah, you need to have a shorter title and a description in your PR comments.

For context, @mchaarawi the purpose of this change will be to support clients that are behind a firewall. See https://daosio.atlassian.net/browse/DAOS-16906

Long story short, when mercury and libfabric changes are in and enabled, libfabric will return an error when an endpoint marked as behind a firewall can't be contacted. This is to avoid user having to poke massive holes in their firewall for parallelstore to work. Instead, the client will always ping to reestablish/create a connection if one does not already exist.

@karthjyojay
Copy link
Contributor Author

@mchaarawi , my apologies. I meant to just give this to @jolivier23 and @wangdi1. I've put it back into draft.

@karthjyojay karthjyojay marked this pull request as draft March 10, 2025 20:47
@karthjyojay karthjyojay force-pushed the dev/karthj/firewall-simplification branch from 0fe8de6 to 1187c70 Compare March 10, 2025 23:13
@karthjyojay karthjyojay marked this pull request as ready for review March 10, 2025 23:27
jolivier23
jolivier23 previously approved these changes Mar 11, 2025
This change adds logic which pings all targets that are involved in the object retry.
When the retry function gets an error signifying that the server could not reach clients,
the update will ping the relevant targets to establish a connection so the update can
retry.

Signed-off-by: Yokesh Jayakumar <karthj@google.com>
Signed-off-by: Yokesh Jayakumar <karthj@google.com>
Signed-off-by: Yokesh Jayakumar <karthj@google.com>
Signed-off-by: Yokesh Jayakumar <karthj@google.com>
@karthjyojay karthjyojay force-pushed the dev/karthj/firewall-simplification branch from 72493ba to b0fca66 Compare March 12, 2025 01:36
@karthjyojay karthjyojay changed the base branch from master to feature/firewall March 12, 2025 01:37
@karthjyojay karthjyojay dismissed jolivier23’s stale review March 12, 2025 01:37

The base branch was changed.

@karthjyojay karthjyojay requested a review from jolivier23 March 12, 2025 17:08
Previously, I was getting an error in the unit test saying that HG_Finalize
could not work since the bulk handle was not being freed. This is because we
were incorrectly returning early.

Signed-off-by: Yokesh Jayakumar <karthj@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants