-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance]Optimize ToBlock in CPU #5192
Conversation
To trigger regression tests:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this implementation developed by your own, or is it based on some existing algorithm? If the latter, it's better to give a reference. If the former, we may need a docstring (as code comment) on how the algorithm works.
IdType value; | ||
}; | ||
|
||
ConcurrentIdHashMap() : hmap(NULL) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use nullptr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
mask = static_cast<IdType>(new_size - 1); | ||
|
||
DGLContext ctx; | ||
ctx.device_type = kDGLCPU; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably just do
DGLContext ctx{kDGLCPU, 0};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
device->AllocWorkspace(ctx, sizeof(Mapping) * new_size)); | ||
|
||
// parallel memset can accelerate it | ||
memset((void*)hmap, -1, sizeof(Mapping)*new_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is std::atomic
compatible with memset
? I feel it's dangerous if std::atomic
has internal states other than the value itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove for security. It is dangerous after C++ 20 as the constructor is non-trivial.
IdType* unique_ids_data = unique_ids.Ptr<IdType>(); | ||
|
||
// Insert all elements in this loop. | ||
#pragma omp parallel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do runtime::parallel_for
? I think they are doing the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are equal. Changed for fix.
protected: | ||
static constexpr IdType kEmptyKey = static_cast<IdType>(-1); | ||
Mapping* hmap; | ||
IdType mask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The protected/private members (hmap
and mask
) are better suffixed with _
to distinguish from local variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
void insert_cas(IdType id, std::vector<bool>& valid, size_t index) { | ||
IdType pos = (id & mask); | ||
IdType delta = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is delta
set at 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a experience, but drop it as the gain is small but lead misunderstanding.
|
||
void Init(size_t n) { | ||
size_t new_size; | ||
for (new_size = 1; new_size < 3 * n; new_size *= 2) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 3 * n?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is gained from experience
IdType a = kEmptyKey; | ||
bool success = hmap[pos].key.compare_exchange_weak(a, key); | ||
if (success) { | ||
valid[index] = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this cause a race condition? Each element in std::vector<bool>
may not be a whole processor word, so I don't know if setting valid[a]
and valid[a+1]
simultaneously will still yield the correct result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to my understanding, it will not. As cache coherence will take care of this.
memset((void*)hmap, -1, sizeof(Mapping)*new_size); | ||
} | ||
|
||
size_t Update(IdArray ids, IdArray& unique_ids) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Update
only allowed to be called once? If so, we could probably merge this into Init
(or even better into the ctor)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, will move it into Init.
induced_edges.push_back(edge_arrays[etype].id); | ||
} else { | ||
induced_edges.push_back( | ||
aten::NullArray(DGLDataType{kDGLInt, sizeof(IdType) * 8, 1}, ctx)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use DGLDataTypeTraits<IdType>::dtype
to find the correct DGLDataType
. Same for other occurrences.
Hi @peizhou001, I am curious about the perf gain by this PR. Could you pls provide more details here? Thanks! |
As discussed with @frozenbugs and @Rhett-Ying , this PR will be split-ed to 2 PRs, which are adding a CPUIdHashMap and code change in ToBlock. All comments from @BarclayII will be adopted and the corresponding code change is in new PR #5241. So this PR is closed now. |
As discussed with @frozenbugs and @Rhett-Ying , this PR will be split-ed to 2 PRs, which are adding a CPUIdHashMap and code change in ToBlock. All comments from @BarclayII will be adopted and the corresponding code change is in new PR #5241. |
Description
Optimize C++ CPU ToBlock() implementation with multi-threading.
Checklist
Please feel free to remove inapplicable items for your PR.
Changes