-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Improve COOToCSR implementation #5508
Conversation
To trigger regression tests:
|
cf8f18b
to
53f3b18
Compare
@dgl-bot run |
8e21e6a
to
ae93c02
Compare
0dce646
to
4b342f5
Compare
Hi @anko-intel , please let us know when the PR is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't want to have this ifdef branch querying omp_get_num_threads from inside parallel region would work for both cases. LGTM
389924a
to
6a8ae32
Compare
@frozenbugs, the PR is ready for review |
+ remove COOToCSR checks which are intended to ensure that each version of algorythm is tested. But it doesn't check it
The change consists of: - Add separate 1 thread algorithm for small matrices. - Improve choosing algorithm according to measured performance - Use index type when possible - In UnSortedSparseCOOToCSR : - Lower number of used thread for relatively small matrix - Allocate partial sum array before parallel section - Rearrange code to avoid additional Bp patching at the end
@frozenbugs, @BarclayII, could you review and merge if there are no issues? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rhett-Ying Do we have benchmark for COOToCSR? If not, can you guide @anko-intel to add a benchmark test?
No. The best way to measure the performance of this change is add c++-level benchmark but we don't have any such framework. As a workaround, we could add python-level benchmarks as here. we could measure within a python-level call which will call |
@dgl-bot run r6i.16xlarge bench_format dmlc/master |
Start the Regression test. View at https://dgl-jenkins-eksvpc-2136217999.us-west-2.elb.amazonaws.com/job/dgl/job/PR-5508/14/display/redirect |
We already have a benchmark testing COOToCSR, but it only covers 1 case of the 4 cases (sorted, unsortedSmall, unsortedSparse, unsortedDense). It is better to refactor and add the benchmark to our new framework directly. @Rhett-Ying how to add a new benchmark to our new framework? |
just refer to existing benchmark and add it into https://github.com/dglai/DGL_scripts/tree/master/regression/benchmarks. |
As discussed w/ @Rhett-Ying offline, it is not easy to add the benchmark to our new framework directly, can you refactor the existing benchmark to cover all 4 cases, and we will migrate them to the new framework once it is ready. |
Finished the Regression test. Result table is at https://dgl-asv-data.s3-us-west-2.amazonaws.com/4e7094a953ce8c056858e5907de098cae9ac79a9_r6i16xlarge/results/result.csv. Jenkins job link is https://dgl-jenkins-eksvpc-2136217999.us-west-2.elb.amazonaws.com/job/dgl/job/PR-5508/14/display/redirect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comments.
According to test:
@frozenbugs, I don't have access to https://github.com/dglai/DGL_scripts/tree/master/regression/benchmarks
Maybe a good starting point for c++ test could be the POC of benchmark made on top of gtest ( see https://gist.github.com/anko-intel/b9c4ab9d0eefd2fd496a8ea38a9a11db ) which I use for measurements.
I could try also to modify python benchmark to check performance of all algorithm versions.
Yes, please modify python benchmark to cover all cases. |
d9082ec
to
b047751
Compare
@frozenbugs, I modified the benchmark (b047751) to have results from all COOToCSR algorithm on CPU.
On r6i.16xlarge (ami-0343fe6cfc8a09c18 - Ubuntu 20.04) the modified benchmark gives the following results:
Is this benchmark ok for you? |
Thanks! the result looks great! |
Co-authored-by: Hongzhi (Steve), Chen <chenhongzhi.nkcs@gmail.com>
Co-authored-by: Hongzhi (Steve), Chen <chenhongzhi.nkcs@gmail.com>
Description
The change consists of:
- Reduce the number of used threads for relatively small matrix
- Allocate partial sum array before parallel section
- Rearrange code to avoid additional Bp patching at the end
It brings COOTOCSR performance improvements. See results from r6i.16xlarge (Intel(R) Xeon(R) Platinum 8375C) below:
from the benchmark created from the patch on gtest patch with benchmark and run with ./build/runUnitTests --gtest_filter=*Bench*
"sorted by" columns with value "raw" means that input COO was sorted by row but not marked as sorted
Checklist
Please feel free to remove inapplicable items for your PR.
Changes