DAOS-16908 object: add client-side target pinging on update retry #16069

karthjyojay · 2025-03-10T18:15:19Z

In the case of update bulk transfers, the server targets may try to communicate with clients who have not yet established a connection with them. In this case, the connection operation hangs forever. In Google Parallelstore, this is an issue because we do not have control over the client firewalls. To deal with this issue, we will have the server tell the client that they cannot connect to the client due to this scenario and the client will ping the affected targets before retrying their update RPC.

The purpose of this PR is to introduce client-side logic to ping targets involved in an update RPC.

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

github-actions · 2025-03-10T18:15:36Z

Ticket title is 'Modify DAOS to use new mercury changes to implement improved firewall handling'
Status is 'Open'
https://daosio.atlassian.net/browse/DAOS-16908

daosbuild1 · 2025-03-10T19:00:19Z

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-16069/2/testReport/

daosbuild1 · 2025-03-10T19:36:38Z

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-16069/2/testReport/

mchaarawi

TBH im not sure what is the purpose of this PR. there is no description in the PR itself and the ticket looks like an unrelated or maybe just a parent ticket.
This is probably the first step to be able to give more meaningful feedback

src/include/daos/pool.h

src/include/daos_errno.h

src/object/obj_internal.h

mchaarawi · 2025-03-10T20:33:51Z

src/object/srv_obj.c

+	if (DAOS_FAIL_CHECK(DAOS_CLIENT_UNREACHABLE) && obj_rpc_is_update(rpc)) {
+		/** Fault injection - client unreachable. */
+		D_INFO("enabled fault injection client unreachable");
+		rc = -DER_CLIENT_UNREACH;
+		goto out;
+	}


is this PR just for fault injection?
I have missed where we can return that error code anywhere else from the server..

Yes, there are other PRs in flight but the point was to get the client code working in parallel with those other changes.

maybe a good candidate for a feature branch then?

If we move to a feature branch, are you ok with moving to a feature branch after this PR is landed?

Resolving this, please reopen as needed.

We can continue to use this branch as a baseline. For more features built on this, we can make new child PRs that are based on this branch as a baseline and then slowly merge onto this PR.

the main advantage of a feature branch is that you can land PRs to it incrementally (have to be reviewed and passing tests same way as it it would be for master). then once all PRs in, the branch can be just landed to master without needing to review all the changes again. provided it passes CI testing of course.

We made a feature branch and changed this PR to merge my dev branch into this feature branch.

When you get a chance, @mchaarawi, can you please take a look again?

@mchaarawi @jolivier23 , sorry, it looks like for some reason Jenkins is not able to build my PR. Jeff told me the way around this is making a new PR. Please hold off on re-reviewing this while I get a new PR ready.

src/object/cli_obj.c

src/pool/cli.c

jolivier23 · 2025-03-10T20:44:10Z

TBH im not sure what is the purpose of this PR. there is no description in the PR itself and the ticket looks like an unrelated or maybe just a parent ticket. This is probably the first step to be able to give more meaningful feedback

@karthjyojay probably meant to make this a draft but yeah, you need to have a shorter title and a description in your PR comments.

For context, @mchaarawi the purpose of this change will be to support clients that are behind a firewall. See https://daosio.atlassian.net/browse/DAOS-16906

Long story short, when mercury and libfabric changes are in and enabled, libfabric will return an error when an endpoint marked as behind a firewall can't be contacted. This is to avoid user having to poke massive holes in their firewall for parallelstore to work. Instead, the client will always ping to reestablish/create a connection if one does not already exist.

karthjyojay · 2025-03-10T20:46:41Z

@mchaarawi , my apologies. I meant to just give this to @jolivier23 and @wangdi1. I've put it back into draft.

src/object/cli_obj.c

This change adds logic which pings all targets that are involved in the object retry. When the retry function gets an error signifying that the server could not reach clients, the update will ping the relevant targets to establish a connection so the update can retry. Signed-off-by: Yokesh Jayakumar <karthj@google.com>

Signed-off-by: Yokesh Jayakumar <karthj@google.com>

The base branch was changed.

Previously, I was getting an error in the unit test saying that HG_Finalize could not work since the bulk handle was not being freed. This is because we were incorrectly returning early. Signed-off-by: Yokesh Jayakumar <karthj@google.com>

karthjyojay force-pushed the dev/karthj/firewall-simplification branch from 389fd52 to e334a8c Compare March 10, 2025 18:21

karthjyojay changed the title ~~DAOS-16908 object: add client-side target pinging on update retry due to firewall client unreachable error~~ DAOS-16908 object: add client-side target pinging on update retry Mar 10, 2025

karthjyojay requested review from jolivier23 and wangdi1 March 10, 2025 18:54

karthjyojay marked this pull request as ready for review March 10, 2025 18:56

karthjyojay requested review from a team as code owners March 10, 2025 18:56

karthjyojay requested a review from a team as a code owner March 10, 2025 20:22

mchaarawi requested changes Mar 10, 2025

View reviewed changes

karthjyojay marked this pull request as draft March 10, 2025 20:47

jolivier23 reviewed Mar 10, 2025

View reviewed changes

src/object/cli_obj.c Outdated Show resolved Hide resolved

karthjyojay force-pushed the dev/karthj/firewall-simplification branch from 0fe8de6 to 1187c70 Compare March 10, 2025 23:13

karthjyojay marked this pull request as ready for review March 10, 2025 23:27

karthjyojay requested review from mchaarawi and jolivier23 March 10, 2025 23:27

jolivier23 previously approved these changes Mar 11, 2025

View reviewed changes

karthjyojay added 5 commits March 12, 2025 01:35

Add missing gurt unit test for new error code

e03086f

Resolve partial commit feedback

4510fe4

Signed-off-by: Yokesh Jayakumar <karthj@google.com>

Change error name to DER_NO_CONNECTION

e82406b

Signed-off-by: Yokesh Jayakumar <karthj@google.com>

Change error name to DER_RECONNECT

b0fca66

Signed-off-by: Yokesh Jayakumar <karthj@google.com>

karthjyojay force-pushed the dev/karthj/firewall-simplification branch from 72493ba to b0fca66 Compare March 12, 2025 01:36

karthjyojay changed the base branch from master to feature/firewall March 12, 2025 01:37

karthjyojay requested a review from jolivier23 March 12, 2025 17:08

Move return rc to end of function.

6095dc6

Previously, I was getting an error in the unit test saying that HG_Finalize could not work since the bulk handle was not being freed. This is because we were incorrectly returning early. Signed-off-by: Yokesh Jayakumar <karthj@google.com>

karthjyojay closed this Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-16908 object: add client-side target pinging on update retry #16069

DAOS-16908 object: add client-side target pinging on update retry #16069

karthjyojay commented Mar 10, 2025 •

edited

Loading

github-actions bot commented Mar 10, 2025 •

edited

Loading

daosbuild1 commented Mar 10, 2025

daosbuild1 commented Mar 10, 2025

mchaarawi left a comment

mchaarawi Mar 10, 2025

jolivier23 Mar 10, 2025

mchaarawi Mar 11, 2025

karthjyojay Mar 11, 2025

karthjyojay Mar 11, 2025

karthjyojay Mar 11, 2025

mchaarawi Mar 12, 2025

karthjyojay Mar 12, 2025

karthjyojay Mar 12, 2025

karthjyojay Mar 12, 2025

jolivier23 commented Mar 10, 2025

karthjyojay commented Mar 10, 2025

DAOS-16908 object: add client-side target pinging on update retry #16069

DAOS-16908 object: add client-side target pinging on update retry #16069

Conversation

karthjyojay commented Mar 10, 2025 • edited Loading

Steps for the author:

After all prior steps are complete:

github-actions bot commented Mar 10, 2025 • edited Loading

daosbuild1 commented Mar 10, 2025

daosbuild1 commented Mar 10, 2025

mchaarawi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jolivier23 commented Mar 10, 2025

karthjyojay commented Mar 10, 2025

karthjyojay commented Mar 10, 2025 •

edited

Loading

github-actions bot commented Mar 10, 2025 •

edited

Loading