-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DistDGL] fix distributed partition issue #6847
Conversation
To trigger regression tests:
|
@thvasilo could you try with this patch(you could just directly change the code in your installed path)? it passes the verify script I previously created. |
Yeah seems to solve the particular error we were seeing. I'd like to test with some large dataset with distributed execution as well, I'll have the time to do that tomorrow. |
Don't see any regression in our large scale test as well, so this should be good to merge. Thanks @Rhett-Ying ! |
@Rhett-Ying Is it possible that this PR is the cause of the CI failures in #6865? |
It's another known issue. let me fix it |
Description
The bug in previous implementation of
srcids update
is the line that recover the sortedsrcids
back to original order:srcids[sort_ids] = srcids
. Such in-place is incorrect as it sequentially overwrites elements in the sorted array, leading to incorrect results and loss of data. This error will results in unexpected value ofNID/EID
. The correct way is to create a new array to store like:But in this PR, I[Rui] just replace previous implementation via a single call:
srcids = np.searchsorted(uniques, srcids, side="left")
. From my understanding, it should be more efficient in both performance and memory compared to iterating over manually. But for the record, I haven't profiled it.Previous Implementation
Checklist
Please feel free to remove inapplicable items for your PR.
Changes