Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Improve COOToCSR implementation #5508

Merged
merged 10 commits into from
May 10, 2023

Conversation

anko-intel
Copy link
Collaborator

@anko-intel anko-intel commented Mar 29, 2023

Description

The change consists of:

  • Add tests if all Unsorted*COOtoCSR algorithms give the same result
  • Add separate 1-thread algorithm for small matrices.
  • Improve choosing algorithm according to measured performance
  • Use index type when possible
  • In UnSortedSparseCOOToCSR :
    - Reduce the number of used threads for relatively small matrix
    - Allocate partial sum array before parallel section
    - Rearrange code to avoid additional Bp patching at the end

It brings COOTOCSR performance improvements. See results from r6i.16xlarge (Intel(R) Xeon(R) Platinum 8375C) below:

        before after before / after
index type N NNZ 1-sparsity sort by [us] [us] x
int32 256 3 4.58E-05 - 71 1 71.00
int32 256 3 4.58E-05 row 71 1 71.00
int64 256 3 4.58E-05 - 71 1 71.00
int64 256 3 4.58E-05 row 70 1 70.00
int32 256 123 0.00188 - 72 1 72.00
int32 256 123 0.00188 row 72 1 72.00
int64 256 123 0.00188 - 71 1 71.00
int64 256 123 0.00188 row 72 1 72.00
int32 1256 1099 0.000697 - 76 4 19.00
int32 1256 1099 0.000697 row 72 4 18.00
int64 1256 1099 0.000697 - 74 5 14.80
int64 1256 1099 0.000697 row 71 5 14.20
int32 5000 9000 0.00036 - 85 38 2.24
int32 5000 9000 0.00036 row 72 30 2.40
int64 5000 9000 0.00036 - 194 50 3.88
int64 5000 9000 0.00036 row 164 34 4.82
int32 2708 10556 0.00144 - 90 41 2.20
int32 2708 10556 0.00144 row 74 38 1.95
int64 2708 10556 0.00144 - 213 57 3.74
int64 2708 10556 0.00144 row 180 45 4.00
int32 6000 12000 0.000333 - 90 52 1.73
int32 6000 12000 0.000333 row 74 38 1.95
int64 6000 12000 0.000333 - 231 73 3.16
int64 6000 12000 0.000333 row 193 50 3.86
int32 5500 13000 0.00043 - 91 31 2.94
int32 5500 13000 0.00043 row 74 18 4.11
int64 5500 13000 0.00043 - 236 95 2.48
int64 5500 13000 0.00043 row 201 64 3.14
int32 1000 14702 0.0147 - 98 39 2.51
int32 1000 14702 0.0147 row 76 19 4.00
int64 1000 14702 0.0147 - 256 98 2.61
int64 1000 14702 0.0147 row 207 87 2.38
int32 15444 14781 6.20E-05 - 92 45 2.04
int32 15444 14781 6.20E-05 row 75 23 3.26
int64 15444 14781 6.20E-05 - 318 118 2.69
int64 15444 14781 6.20E-05 row 230 83 2.77
int32 1000 14814 0.0148 - 100 39 2.56
int32 1000 14814 0.0148 row 79 19 4.16
int64 1000 14814 0.0148 - 260 90 2.89
int64 1000 14814 0.0148 row 216 88 2.45
int32 15360 30222 0.000128 - 98 60 1.63
int32 15360 30222 0.000128 row 78 31 2.52
int64 15360 30222 0.000128 - 399 258 1.55
int64 15360 30222 0.000128 row 303 193 1.57
int32 15360 90222 0.000382 - 118 83 1.42
int32 15360 90222 0.000382 row 95 53 1.79
int64 15360 90222 0.000382 - 865 717 1.21
int64 15360 90222 0.000382 row 633 520 1.22
int32 15360 152305 0.000646 - 154 119 1.29
int32 15360 152305 0.000646 row 109 65 1.68
int64 15360 152305 0.000646 - 1067 921 1.16
int64 15360 152305 0.000646 row 814 705 1.15
int32 5000 500000 0.02 - 402 401 1.00
int32 5000 500000 0.02 row 214 209 1.02
int64 5000 500000 0.02 - 918 924 0.99
int64 5000 500000 0.02 row 544 550 0.99
int32 128532 640578 3.88E-05 - 558 532 1.05
int32 128532 640578 3.88E-05 row 245 164 1.49
int64 128532 640578 3.88E-05 - 1400 1325 1.06
int64 128532 640578 3.88E-05 row 766 650 1.18
int32 128413 639946 3.88E-05 - 580 540 1.07
int32 128413 639946 3.88E-05 row 237 165 1.44
int64 128413 639946 3.88E-05 - 649 679 0.96
int64 128413 639946 3.88E-05 row 296 242 1.22
int32 129408 644918 3.85E-05 - 570 523 1.09
int32 129408 644918 3.85E-05 row 242 155 1.56
int64 129408 644918 3.85E-05 - 2413 2487 0.97
int64 129408 644918 3.85E-05 row 1810 1798 1.01
int32 1000111 9001999 9.00E-06 - 21962 22156 0.99
int32 1000111 9001999 9.00E-06 row 11782 11856 0.99
int64 1000111 9001999 9.00E-06 - 37553 37188 1.01
int64 1000111 9001999 9.00E-06 row 19216 22790 0.84
int32 4847571 68993773 2.94E-06 - 223471 225930 0.99
int32 4847571 68993773 2.94E-06 row 66450 62281 1.07
int64 4847571 68993773 2.94E-06 - 330233 329032 1.00
int64 4847571 68993773 2.94E-06 row 121706 119325 1.02
int32 2449029 123718280 2.06E-05 - 518267 524534 0.99
int32 2449029 123718280 2.06E-05 row 106771 106131 1.01
int64 2449029 123718280 2.06E-05 - 585970 587434 1.00
int64 2449029 123718280 2.06E-05 row 150090 147763 1.02
int32 5111000 235100880 9.00E-06 - 1037472 1041652 1.00
int32 5111000 235100880 9.00E-06 row 219230 216778 1.01
int64 5111000 235100880 9.00E-06 - 1309228 1255315 1.04
int64 5111000 235100880 9.00E-06 row 421058 364921 1.15
int32 9111000 835100880 1.01E-05 - 3727657 3632701 1.03
int32 9111000 835100880 1.01E-05 row 618800 522001 1.19
int64 9111000 835100880 1.01E-05 - 4508602 4399552 1.02
int64 9111000 835100880 1.01E-05 row 814650 804458 1.01
int32 111000 835100880 0.0678 - 1873541 1871407 1.00
int32 111000 835100880 0.0678 row 250951 246375 1.02
int64 111000 835100880 0.0678 - 2607837 2592142 1.01
int64 111000 835100880 0.0678 row 476085 469125 1.01
int32 65608366 1806067135 4.20E-07 - 10932822 9200200 1.19
int32 65608366 1806067135 4.20E-07 row 1836820 1611311 1.14
int64 65608366 1806067135 4.20E-07 - 14934669 14869272 1.00
int64 65608366 1806067135 4.20E-07 row 3501914 3463551 1.01

from the benchmark created from the patch on gtest patch with benchmark and run with ./build/runUnitTests --gtest_filter=*Bench*
"sorted by" columns with value "raw" means that input COO was sorted by row but not marked as sorted

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 29, 2023

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 29, 2023

Commit ID: d3db00c

Build ID: 1

Status: ❌ CI test failed in Stage [CPU Build (Win64)].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 29, 2023

Commit ID: cf8f18b

Build ID: 2

Status: ❌ CI test failed in Stage [CPU Build (Win64)].

Report path: link

Full logs path: link

@anko-intel
Copy link
Collaborator Author

@dgl-bot

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 29, 2023

Commit ID: 53f3b18

Build ID: 3

Status: ❌ CI test failed in Stage [CPU Build (Win64)].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 30, 2023

Commit ID: 8e21e6a

Build ID: 4

Status: ❌ CI test failed in Stage [Torch CPU (Win64) Unit test].

Report path: link

Full logs path: link

@anko-intel
Copy link
Collaborator Author

@dgl-bot run

@anko-intel
Copy link
Collaborator Author

@dgl-bot

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 30, 2023

Commit ID: ae93c02

Build ID: 5

Status: ❌ CI test failed in Stage [Torch CPU (Win64) Unit test].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 3, 2023

Commit ID: ddb5a9d3b5bf76bf9c9e0b747ada7f8ea9b02d3e

Build ID: 6

Status: ❌ CI test failed in Stage [Torch CPU (Win64) Unit test].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 4, 2023

Commit ID: ae3efd6570f5f2d8c1329474d9ed1e1d77a1d6fe

Build ID: 7

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@anko-intel anko-intel force-pushed the dev/anko_coo2csr_6 branch from 0dce646 to 4b342f5 Compare April 4, 2023 10:41
@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 4, 2023

Commit ID: 4b342f5

Build ID: 8

Status: ❌ CI test failed in Stage [Torch CPU (Win64) Unit test].

Report path: link

Full logs path: link

@frozenbugs
Copy link
Collaborator

Hi @anko-intel , please let us know when the PR is ready for review.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 26, 2023

Commit ID: 0193a7e54dab01faf65c25f0febbb45a758c5890

Build ID: 9

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

Copy link

@bgawrych bgawrych left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't want to have this ifdef branch querying omp_get_num_threads from inside parallel region would work for both cases. LGTM

@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 28, 2023

Commit ID: 389924a

Build ID: 10

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 28, 2023

Commit ID: 6a8ae32

Build ID: 11

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@anko-intel
Copy link
Collaborator Author

@frozenbugs, the PR is ready for review

anko-intel added 5 commits May 4, 2023 13:28
+ remove COOToCSR checks which are  intended to ensure that each version of algorythm is
tested. But it doesn't check it
The change consists of:
- Add separate 1 thread algorithm for small matrices.
- Improve choosing algorithm according to measured performance
- Use index type when possible
- In UnSortedSparseCOOToCSR :
  - Lower number of used thread for relatively small matrix
  - Allocate partial sum  array before parallel section
  - Rearrange code to avoid additional Bp patching at the end
@dgl-bot
Copy link
Collaborator

dgl-bot commented May 4, 2023

Commit ID: 4e7094a

Build ID: 13

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@anko-intel
Copy link
Collaborator Author

@frozenbugs, @BarclayII, could you review and merge if there are no issues?

Copy link
Collaborator

@frozenbugs frozenbugs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rhett-Ying Do we have benchmark for COOToCSR? If not, can you guide @anko-intel to add a benchmark test?

@Rhett-Ying
Copy link
Collaborator

@Rhett-Ying Do we have benchmark for COOToCSR? If not, can you guide @anko-intel to add a benchmark test?

No. The best way to measure the performance of this change is add c++-level benchmark but we don't have any such framework. As a workaround, we could add python-level benchmarks as here. we could measure within a python-level call which will call COOToCSR in the back. graph.create_formats() or graph.formats() could be an option.

@Rhett-Ying
Copy link
Collaborator

@dgl-bot run r6i.16xlarge bench_format dmlc/master

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 6, 2023

@frozenbugs
Copy link
Collaborator

@Rhett-Ying Do we have benchmark for COOToCSR? If not, can you guide @anko-intel to add a benchmark test?

No. The best way to measure the performance of this change is add c++-level benchmark but we don't have any such framework. As a workaround, we could add python-level benchmarks as here. we could measure within a python-level call which will call COOToCSR in the back. graph.create_formats() or graph.formats() could be an option.

We already have a benchmark testing COOToCSR, but it only covers 1 case of the 4 cases (sorted, unsortedSmall, unsortedSparse, unsortedDense). It is better to refactor and add the benchmark to our new framework directly. @Rhett-Ying how to add a new benchmark to our new framework?

@Rhett-Ying
Copy link
Collaborator

Rhett-Ying commented May 6, 2023

how to add to new benchmark framwork

just refer to existing benchmark and add it into https://github.com/dglai/DGL_scripts/tree/master/regression/benchmarks.

@frozenbugs
Copy link
Collaborator

frozenbugs commented May 6, 2023

As discussed w/ @Rhett-Ying offline, it is not easy to add the benchmark to our new framework directly, can you refactor the existing benchmark to cover all 4 cases, and we will migrate them to the new framework once it is ready.

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 6, 2023

Commit ID: 4e7094a

Build ID: 14

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

Copy link
Collaborator Author

@anko-intel anko-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comments.
According to test:
@frozenbugs, I don't have access to https://github.com/dglai/DGL_scripts/tree/master/regression/benchmarks
Maybe a good starting point for c++ test could be the POC of benchmark made on top of gtest ( see https://gist.github.com/anko-intel/b9c4ab9d0eefd2fd496a8ea38a9a11db ) which I use for measurements.
I could try also to modify python benchmark to check performance of all algorithm versions.

@frozenbugs frozenbugs closed this May 8, 2023
@frozenbugs frozenbugs reopened this May 8, 2023
@frozenbugs
Copy link
Collaborator

Thank you for your comments. According to test: @frozenbugs, I don't have access to https://github.com/dglai/DGL_scripts/tree/master/regression/benchmarks Maybe a good starting point for c++ test could be the POC of benchmark made on top of gtest ( see https://gist.github.com/anko-intel/b9c4ab9d0eefd2fd496a8ea38a9a11db ) which I use for measurements. I could try also to modify python benchmark to check performance of all algorithm versions.

Yes, please modify python benchmark to cover all cases.

@anko-intel anko-intel force-pushed the dev/anko_coo2csr_6 branch from d9082ec to b047751 Compare May 10, 2023 11:21
@dgl-bot
Copy link
Collaborator

dgl-bot commented May 10, 2023

Commit ID: d4ed31cdc893390ee94ee702a9fd57a7137255f9

Build ID: 15

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@anko-intel
Copy link
Collaborator Author

anko-intel commented May 10, 2023

@frozenbugs, I modified the benchmark (b047751) to have results from all COOToCSR algorithm on CPU.

dataset num nodes num edges alg on cpu when number of available threads is >=
cora 2708 10556 small 14
pubmed 19717 88651 sparse 18
ogbn-arxiv 169343 1166243 sparse 28
livejournal 4847571 68993773 dense 2
friendster 65608366 1806067135 dense 2

On r6i.16xlarge (ami-0343fe6cfc8a09c18 - Ubuntu 20.04) the modified benchmark gives the following results:
(filtered to COO-> SCR|CSC only as these conversions use COOToCSR during time measurement)

dataset sha: ea706ca ccd463d before/after algorithm
cora ('coo','csc') 0.000464917 0.000107054 4.34 small
cora ('coo','csr') 0.000449277 0.000108357 4.15 small
pubmed ('coo','csc') 0.000572143 0.000518983 1.10 sparse
pubmed ('coo','csr') 0.000416732 0.000460249 0.91 sparse
ogbn-arxiv ('coo','csc') 0.002656307 0.002500843 1.06 sparse
ogbn-arxiv ('coo','csr') 0.002499287 0.002125469 1.18 sparse
livejournal ('coo','csc') 0.368533011 0.370530003 0.99 dense
livejournal ('coo','csr') 0.222436191 0.21996336 1.01 dense
friendster ('coo','csc') 13.80159092 13.75633977 1.00 dense
friendster ('coo','csr') 1.096111846 1.089540845 1.01 sorted

Is this benchmark ok for you?

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 10, 2023

Commit ID: d89cc24767e75ed1b5871d1a737d8db9adbb1fd1

Build ID: 16

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@frozenbugs
Copy link
Collaborator

@frozenbugs, I modified the benchmark (b047751) to have results from all COOToCSR algorithm on CPU.

dataset num nodes num edges alg on cpu when number of available threads is >=
cora 2708 10556 small 14
pubmed 19717 88651 sparse 18
ogbn-arxiv 169343 1166243 sparse 28
livejournal 4847571 68993773 dense 2
friendster 65608366 1806067135 dense 2
On r6i.16xlarge (ami-0343fe6cfc8a09c18 - Ubuntu 20.04) the modified benchmark gives the following results: (filtered to COO-> SCR|CSC only as these conversions use COOToCSR during time measurement)

dataset sha: ea706ca ccd463d before/after algorithm
cora ('coo','csc') 0.000464917 0.000107054 4.34 small
cora ('coo','csr') 0.000449277 0.000108357 4.15 small
pubmed ('coo','csc') 0.000572143 0.000518983 1.10 sparse
pubmed ('coo','csr') 0.000416732 0.000460249 0.91 sparse
ogbn-arxiv ('coo','csc') 0.002656307 0.002500843 1.06 sparse
ogbn-arxiv ('coo','csr') 0.002499287 0.002125469 1.18 sparse
livejournal ('coo','csc') 0.368533011 0.370530003 0.99 dense
livejournal ('coo','csr') 0.222436191 0.21996336 1.01 dense
friendster ('coo','csc') 13.80159092 13.75633977 1.00 dense
friendster ('coo','csr') 1.096111846 1.089540845 1.01 sorted
Is this benchmark ok for you?

Thanks! the result looks great!

@frozenbugs frozenbugs merged commit e0d2250 into dmlc:master May 10, 2023
@anko-intel anko-intel deleted the dev/anko_coo2csr_6 branch May 11, 2023 06:32
Rhett-Ying pushed a commit that referenced this pull request Jun 20, 2023
Co-authored-by: Hongzhi (Steve), Chen <chenhongzhi.nkcs@gmail.com>
DominikaJedynak pushed a commit to DominikaJedynak/dgl that referenced this pull request Mar 12, 2024
Co-authored-by: Hongzhi (Steve), Chen <chenhongzhi.nkcs@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants