Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DistDGL] fix distributed partition issue #6847

Merged
merged 3 commits into from
Dec 29, 2023

Conversation

Rhett-Ying
Copy link
Collaborator

@Rhett-Ying Rhett-Ying commented Dec 27, 2023

Description

The bug in previous implementation of srcids update is the line that recover the sorted srcids back to original order: srcids[sort_ids] = srcids. Such in-place is incorrect as it sequentially overwrites elements in the sorted array, leading to incorrect results and loss of data. This error will results in unexpected value of NID/EID. The correct way is to create a new array to store like:

new_srcids = np.zeros_like(srcids)
new_srcids[sort_ids] = srcids
srcids = new_srcids

But in this PR, I[Rui] just replace previous implementation via a single call: srcids = np.searchsorted(uniques, srcids, side="left"). From my understanding, it should be more efficient in both performance and memory compared to iterating over manually. But for the record, I haven't profiled it.

Previous Implementation

# build inverse idxes for srcids, dstids and nids
    srcids = np.searchsorted(uniques, srcids, side="left")
    # over-write the srcids and dstids arrays.
    sort_ids = np.argsort(srcids)
    srcids = srcids[sort_ids]

    # TODO: check if wrapping this while loop in a c++ wrapper
    # helps in speeding up the code
    idx1 = 0
    idx2 = 0
    while (idx1 < len(srcids)) and (idx2 < len(uniques)):
        if srcids[idx1] == uniques[idx2]:
            srcids[idx1] = idx2
            idx1 += 1
        elif srcids[idx1] < uniques[idx2]:
            idx1 += 1
        else:
            idx2 += 1

    assert idx1 >= len(srcids), (
        f"Failed to locate all srcids in the uniques array "
        f" len(srcids) = {len(srcids)}, idx1 = {idx1} "
        f" len(uniques) = {len(uniques)}, idx2 = {idx2}"
    )
    srcids[sort_ids] = srcids

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@dgl-bot
Copy link
Collaborator

dgl-bot commented Dec 27, 2023

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@dgl-bot
Copy link
Collaborator

dgl-bot commented Dec 27, 2023

Commit ID: 84c89360021b22909a8f78172537cda0e47af9bd

Build ID: 1

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Dec 27, 2023

Commit ID: 368d7c657496f97ab628a850092621ee4bd7c39c

Build ID: 2

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@Rhett-Ying
Copy link
Collaborator Author

@thvasilo could you try with this patch(you could just directly change the code in your installed path)? it passes the verify script I previously created.

@thvasilo
Copy link
Contributor

Yeah seems to solve the particular error we were seeing. I'd like to test with some large dataset with distributed execution as well, I'll have the time to do that tomorrow.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Dec 28, 2023

Commit ID: 197f5a989cf7ee8d58f9c2f9a67f503af7553b8e

Build ID: 3

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Dec 29, 2023

Commit ID: 0955467cb9345487cf95425b41b277566d9b6da8

Build ID: 4

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@thvasilo
Copy link
Contributor

Don't see any regression in our large scale test as well, so this should be good to merge. Thanks @Rhett-Ying !

@Rhett-Ying Rhett-Ying merged commit f758c7c into dmlc:master Dec 29, 2023
@Rhett-Ying Rhett-Ying deleted the gb_dist_part_issue branch December 29, 2023 23:49
@mfbalin
Copy link
Collaborator

mfbalin commented Jan 1, 2024

@Rhett-Ying Is it possible that this PR is the cause of the CI failures in #6865?

@Rhett-Ying
Copy link
Collaborator Author

It's another known issue. let me fix it

DominikaJedynak pushed a commit to DominikaJedynak/dgl that referenced this pull request Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants