-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16908 object: add client-side target pinging on update retry #16069
Conversation
Ticket title is 'Modify DAOS to use new mercury changes to implement improved firewall handling' |
389fd52
to
e334a8c
Compare
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-16069/2/testReport/ |
Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-16069/2/testReport/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH im not sure what is the purpose of this PR. there is no description in the PR itself and the ticket looks like an unrelated or maybe just a parent ticket.
This is probably the first step to be able to give more meaningful feedback
src/object/srv_obj.c
Outdated
if (DAOS_FAIL_CHECK(DAOS_CLIENT_UNREACHABLE) && obj_rpc_is_update(rpc)) { | ||
/** Fault injection - client unreachable. */ | ||
D_INFO("enabled fault injection client unreachable"); | ||
rc = -DER_CLIENT_UNREACH; | ||
goto out; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this PR just for fault injection?
I have missed where we can return that error code anywhere else from the server..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are other PRs in flight but the point was to get the client code working in parallel with those other changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a good candidate for a feature branch then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move to a feature branch, are you ok with moving to a feature branch after this PR is landed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolving this, please reopen as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can continue to use this branch as a baseline. For more features built on this, we can make new child PRs that are based on this branch as a baseline and then slowly merge onto this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the main advantage of a feature branch is that you can land PRs to it incrementally (have to be reviewed and passing tests same way as it it would be for master). then once all PRs in, the branch can be just landed to master without needing to review all the changes again. provided it passes CI testing of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We made a feature branch and changed this PR to merge my dev branch into this feature branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you get a chance, @mchaarawi, can you please take a look again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mchaarawi @jolivier23 , sorry, it looks like for some reason Jenkins is not able to build my PR. Jeff told me the way around this is making a new PR. Please hold off on re-reviewing this while I get a new PR ready.
@karthjyojay probably meant to make this a draft but yeah, you need to have a shorter title and a description in your PR comments. For context, @mchaarawi the purpose of this change will be to support clients that are behind a firewall. See https://daosio.atlassian.net/browse/DAOS-16906 Long story short, when mercury and libfabric changes are in and enabled, libfabric will return an error when an endpoint marked as behind a firewall can't be contacted. This is to avoid user having to poke massive holes in their firewall for parallelstore to work. Instead, the client will always ping to reestablish/create a connection if one does not already exist. |
@mchaarawi , my apologies. I meant to just give this to @jolivier23 and @wangdi1. I've put it back into draft. |
0fe8de6
to
1187c70
Compare
This change adds logic which pings all targets that are involved in the object retry. When the retry function gets an error signifying that the server could not reach clients, the update will ping the relevant targets to establish a connection so the update can retry. Signed-off-by: Yokesh Jayakumar <karthj@google.com>
Signed-off-by: Yokesh Jayakumar <karthj@google.com>
Signed-off-by: Yokesh Jayakumar <karthj@google.com>
Signed-off-by: Yokesh Jayakumar <karthj@google.com>
72493ba
to
b0fca66
Compare
Previously, I was getting an error in the unit test saying that HG_Finalize could not work since the bulk handle was not being freed. This is because we were incorrectly returning early. Signed-off-by: Yokesh Jayakumar <karthj@google.com>
In the case of update bulk transfers, the server targets may try to communicate with clients who have not yet established a connection with them. In this case, the connection operation hangs forever. In Google Parallelstore, this is an issue because we do not have control over the client firewalls. To deal with this issue, we will have the server tell the client that they cannot connect to the client due to this scenario and the client will ping the affected targets before retrying their update RPC.
The purpose of this PR is to introduce client-side logic to ping targets involved in an update RPC.
Steps for the author:
After all prior steps are complete: