Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]Optimize ToBlock in CPU #5192

Closed
wants to merge 2 commits into from

Conversation

peizhou001
Copy link
Collaborator

Description

Optimize C++ CPU ToBlock() implementation with multi-threading.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 18, 2023

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@peizhou001 peizhou001 changed the title [DGL Core]Optimize ToBlock in CPU [Performance]Optimize ToBlock in CPU Jan 18, 2023
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 18, 2023

Commit ID: 046b603e0769857b1b4ed6769a977723799a0e40

Build ID: 1

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@peizhou001 peizhou001 marked this pull request as ready for review January 18, 2023 01:25
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jan 18, 2023

Commit ID: 040fd99646ddc5e86ecbbfd9360784f4adffff9d

Build ID: 2

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

Copy link
Collaborator

@BarclayII BarclayII left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this implementation developed by your own, or is it based on some existing algorithm? If the latter, it's better to give a reference. If the former, we may need a docstring (as code comment) on how the algorithm works.

IdType value;
};

ConcurrentIdHashMap() : hmap(NULL) {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use nullptr.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

mask = static_cast<IdType>(new_size - 1);

DGLContext ctx;
ctx.device_type = kDGLCPU;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably just do

DGLContext ctx{kDGLCPU, 0};

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

device->AllocWorkspace(ctx, sizeof(Mapping) * new_size));

// parallel memset can accelerate it
memset((void*)hmap, -1, sizeof(Mapping)*new_size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is std::atomic compatible with memset? I feel it's dangerous if std::atomic has internal states other than the value itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove for security. It is dangerous after C++ 20 as the constructor is non-trivial.

IdType* unique_ids_data = unique_ids.Ptr<IdType>();

// Insert all elements in this loop.
#pragma omp parallel
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do runtime::parallel_for? I think they are doing the same thing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are equal. Changed for fix.

protected:
static constexpr IdType kEmptyKey = static_cast<IdType>(-1);
Mapping* hmap;
IdType mask;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protected/private members (hmap and mask) are better suffixed with _ to distinguish from local variables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


void insert_cas(IdType id, std::vector<bool>& valid, size_t index) {
IdType pos = (id & mask);
IdType delta = 10;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is delta set at 10?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a experience, but drop it as the gain is small but lead misunderstanding.


void Init(size_t n) {
size_t new_size;
for (new_size = 1; new_size < 3 * n; new_size *= 2) {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 3 * n?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is gained from experience

IdType a = kEmptyKey;
bool success = hmap[pos].key.compare_exchange_weak(a, key);
if (success) {
valid[index] = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this cause a race condition? Each element in std::vector<bool> may not be a whole processor word, so I don't know if setting valid[a] and valid[a+1] simultaneously will still yield the correct result.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to my understanding, it will not. As cache coherence will take care of this.

memset((void*)hmap, -1, sizeof(Mapping)*new_size);
}

size_t Update(IdArray ids, IdArray& unique_ids) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Update only allowed to be called once? If so, we could probably merge this into Init (or even better into the ctor)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, will move it into Init.

induced_edges.push_back(edge_arrays[etype].id);
} else {
induced_edges.push_back(
aten::NullArray(DGLDataType{kDGLInt, sizeof(IdType) * 8, 1}, ctx));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use DGLDataTypeTraits<IdType>::dtype to find the correct DGLDataType. Same for other occurrences.

@chang-l
Copy link
Collaborator

chang-l commented Jan 31, 2023

Hi @peizhou001, I am curious about the perf gain by this PR. Could you pls provide more details here? Thanks!

@peizhou001
Copy link
Collaborator Author

As discussed with @frozenbugs and @Rhett-Ying , this PR will be split-ed to 2 PRs, which are adding a CPUIdHashMap and code change in ToBlock. All comments from @BarclayII will be adopted and the corresponding code change is in new PR #5241. So this PR is closed now.

@peizhou001 peizhou001 closed this Jan 31, 2023
@peizhou001
Copy link
Collaborator Author

As discussed with @frozenbugs and @Rhett-Ying , this PR will be split-ed to 2 PRs, which are adding a CPUIdHashMap and code change in ToBlock. All comments from @BarclayII will be adopted and the corresponding code change is in new PR #5241.

@peizhou001 peizhou001 deleted the peizhou/optimize_ToBlock branch August 1, 2023 03:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants