[Performance] Improve COOToCSR implementation #5508

anko-intel · 2023-03-29T08:29:58Z

Description

The change consists of:

Add tests if all Unsorted*COOtoCSR algorithms give the same result
Add separate 1-thread algorithm for small matrices.
Improve choosing algorithm according to measured performance
Use index type when possible
In UnSortedSparseCOOToCSR :
- Reduce the number of used threads for relatively small matrix
- Allocate partial sum array before parallel section
- Rearrange code to avoid additional Bp patching at the end

It brings COOTOCSR performance improvements. See results from r6i.16xlarge (Intel(R) Xeon(R) Platinum 8375C) below:

					before	after	before / after
index type	N	NNZ	1-sparsity	sort by	[us]	[us]	x
int32	256	3	4.58E-05	-	71	1	71.00
int32	256	3	4.58E-05	row	71	1	71.00
int64	256	3	4.58E-05	-	71	1	71.00
int64	256	3	4.58E-05	row	70	1	70.00
int32	256	123	0.00188	-	72	1	72.00
int32	256	123	0.00188	row	72	1	72.00
int64	256	123	0.00188	-	71	1	71.00
int64	256	123	0.00188	row	72	1	72.00
int32	1256	1099	0.000697	-	76	4	19.00
int32	1256	1099	0.000697	row	72	4	18.00
int64	1256	1099	0.000697	-	74	5	14.80
int64	1256	1099	0.000697	row	71	5	14.20
int32	5000	9000	0.00036	-	85	38	2.24
int32	5000	9000	0.00036	row	72	30	2.40
int64	5000	9000	0.00036	-	194	50	3.88
int64	5000	9000	0.00036	row	164	34	4.82
int32	2708	10556	0.00144	-	90	41	2.20
int32	2708	10556	0.00144	row	74	38	1.95
int64	2708	10556	0.00144	-	213	57	3.74
int64	2708	10556	0.00144	row	180	45	4.00
int32	6000	12000	0.000333	-	90	52	1.73
int32	6000	12000	0.000333	row	74	38	1.95
int64	6000	12000	0.000333	-	231	73	3.16
int64	6000	12000	0.000333	row	193	50	3.86
int32	5500	13000	0.00043	-	91	31	2.94
int32	5500	13000	0.00043	row	74	18	4.11
int64	5500	13000	0.00043	-	236	95	2.48
int64	5500	13000	0.00043	row	201	64	3.14
int32	1000	14702	0.0147	-	98	39	2.51
int32	1000	14702	0.0147	row	76	19	4.00
int64	1000	14702	0.0147	-	256	98	2.61
int64	1000	14702	0.0147	row	207	87	2.38
int32	15444	14781	6.20E-05	-	92	45	2.04
int32	15444	14781	6.20E-05	row	75	23	3.26
int64	15444	14781	6.20E-05	-	318	118	2.69
int64	15444	14781	6.20E-05	row	230	83	2.77
int32	1000	14814	0.0148	-	100	39	2.56
int32	1000	14814	0.0148	row	79	19	4.16
int64	1000	14814	0.0148	-	260	90	2.89
int64	1000	14814	0.0148	row	216	88	2.45
int32	15360	30222	0.000128	-	98	60	1.63
int32	15360	30222	0.000128	row	78	31	2.52
int64	15360	30222	0.000128	-	399	258	1.55
int64	15360	30222	0.000128	row	303	193	1.57
int32	15360	90222	0.000382	-	118	83	1.42
int32	15360	90222	0.000382	row	95	53	1.79
int64	15360	90222	0.000382	-	865	717	1.21
int64	15360	90222	0.000382	row	633	520	1.22
int32	15360	152305	0.000646	-	154	119	1.29
int32	15360	152305	0.000646	row	109	65	1.68
int64	15360	152305	0.000646	-	1067	921	1.16
int64	15360	152305	0.000646	row	814	705	1.15
int32	5000	500000	0.02	-	402	401	1.00
int32	5000	500000	0.02	row	214	209	1.02
int64	5000	500000	0.02	-	918	924	0.99
int64	5000	500000	0.02	row	544	550	0.99
int32	128532	640578	3.88E-05	-	558	532	1.05
int32	128532	640578	3.88E-05	row	245	164	1.49
int64	128532	640578	3.88E-05	-	1400	1325	1.06
int64	128532	640578	3.88E-05	row	766	650	1.18
int32	128413	639946	3.88E-05	-	580	540	1.07
int32	128413	639946	3.88E-05	row	237	165	1.44
int64	128413	639946	3.88E-05	-	649	679	0.96
int64	128413	639946	3.88E-05	row	296	242	1.22
int32	129408	644918	3.85E-05	-	570	523	1.09
int32	129408	644918	3.85E-05	row	242	155	1.56
int64	129408	644918	3.85E-05	-	2413	2487	0.97
int64	129408	644918	3.85E-05	row	1810	1798	1.01
int32	1000111	9001999	9.00E-06	-	21962	22156	0.99
int32	1000111	9001999	9.00E-06	row	11782	11856	0.99
int64	1000111	9001999	9.00E-06	-	37553	37188	1.01
int64	1000111	9001999	9.00E-06	row	19216	22790	0.84
int32	4847571	68993773	2.94E-06	-	223471	225930	0.99
int32	4847571	68993773	2.94E-06	row	66450	62281	1.07
int64	4847571	68993773	2.94E-06	-	330233	329032	1.00
int64	4847571	68993773	2.94E-06	row	121706	119325	1.02
int32	2449029	123718280	2.06E-05	-	518267	524534	0.99
int32	2449029	123718280	2.06E-05	row	106771	106131	1.01
int64	2449029	123718280	2.06E-05	-	585970	587434	1.00
int64	2449029	123718280	2.06E-05	row	150090	147763	1.02
int32	5111000	235100880	9.00E-06	-	1037472	1041652	1.00
int32	5111000	235100880	9.00E-06	row	219230	216778	1.01
int64	5111000	235100880	9.00E-06	-	1309228	1255315	1.04
int64	5111000	235100880	9.00E-06	row	421058	364921	1.15
int32	9111000	835100880	1.01E-05	-	3727657	3632701	1.03
int32	9111000	835100880	1.01E-05	row	618800	522001	1.19
int64	9111000	835100880	1.01E-05	-	4508602	4399552	1.02
int64	9111000	835100880	1.01E-05	row	814650	804458	1.01
int32	111000	835100880	0.0678	-	1873541	1871407	1.00
int32	111000	835100880	0.0678	row	250951	246375	1.02
int64	111000	835100880	0.0678	-	2607837	2592142	1.01
int64	111000	835100880	0.0678	row	476085	469125	1.01
int32	65608366	1806067135	4.20E-07	-	10932822	9200200	1.19
int32	65608366	1806067135	4.20E-07	row	1836820	1611311	1.14
int64	65608366	1806067135	4.20E-07	-	14934669	14869272	1.00
int64	65608366	1806067135	4.20E-07	row	3501914	3463551	1.01

from the benchmark created from the patch on gtest patch with benchmark and run with ./build/runUnitTests --gtest_filter=*Bench*
"sorted by" columns with value "raw" means that input COO was sorted by row but not marked as sorted

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
I've leverage the tools to beautify the python and c++ code.
The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
Related issue is referred in this PR
If the PR is for a new model/paper, I've updated the example index here.

Changes

dgl-bot · 2023-03-29T08:30:25Z

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot · 2023-03-29T08:45:52Z

Commit ID: d3db00c

Build ID: 1

Status: ❌ CI test failed in Stage [CPU Build (Win64)].

Report path: link

Full logs path: link

dgl-bot · 2023-03-29T09:16:17Z

Commit ID: cf8f18b

Build ID: 2

Status: ❌ CI test failed in Stage [CPU Build (Win64)].

Report path: link

Full logs path: link

anko-intel · 2023-03-29T09:44:44Z

@dgl-bot

dgl-bot · 2023-03-29T12:16:46Z

Commit ID: 53f3b18

Build ID: 3

Status: ❌ CI test failed in Stage [CPU Build (Win64)].

Report path: link

Full logs path: link

dgl-bot · 2023-03-30T07:59:17Z

Commit ID: 8e21e6a

Build ID: 4

Status: ❌ CI test failed in Stage [Torch CPU (Win64) Unit test].

Report path: link

Full logs path: link

anko-intel · 2023-03-30T09:32:37Z

@dgl-bot run

anko-intel · 2023-03-30T09:56:05Z

@dgl-bot

dgl-bot · 2023-03-30T11:49:08Z

Commit ID: ae93c02

Build ID: 5

Status: ❌ CI test failed in Stage [Torch CPU (Win64) Unit test].

Report path: link

Full logs path: link

dgl-bot · 2023-04-03T19:31:54Z

Commit ID: ddb5a9d3b5bf76bf9c9e0b747ada7f8ea9b02d3e

Build ID: 6

Status: ❌ CI test failed in Stage [Torch CPU (Win64) Unit test].

Report path: link

Full logs path: link

dgl-bot · 2023-04-04T08:23:01Z

Commit ID: ae3efd6570f5f2d8c1329474d9ed1e1d77a1d6fe

Build ID: 7

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2023-04-04T11:20:49Z

Commit ID: 4b342f5

Build ID: 8

Status: ❌ CI test failed in Stage [Torch CPU (Win64) Unit test].

Report path: link

Full logs path: link

frozenbugs · 2023-04-06T02:27:04Z

Hi @anko-intel , please let us know when the PR is ready for review.

src/array/cpu/spmat_op_impl_coo.cc

dgl-bot · 2023-04-26T12:23:16Z

Commit ID: 0193a7e54dab01faf65c25f0febbb45a758c5890

Build ID: 9

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

bgawrych

If we don't want to have this ifdef branch querying omp_get_num_threads from inside parallel region would work for both cases. LGTM

src/array/cpu/spmat_op_impl_coo.cc

dgl-bot · 2023-04-28T14:35:52Z

Commit ID: 389924a

Build ID: 10

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2023-04-28T17:44:25Z

Commit ID: 6a8ae32

Build ID: 11

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

anko-intel · 2023-04-28T21:11:39Z

@frozenbugs, the PR is ready for review

+ remove COOToCSR checks which are intended to ensure that each version of algorythm is tested. But it doesn't check it

The change consists of: - Add separate 1 thread algorithm for small matrices. - Improve choosing algorithm according to measured performance - Use index type when possible - In UnSortedSparseCOOToCSR : - Lower number of used thread for relatively small matrix - Allocate partial sum array before parallel section - Rearrange code to avoid additional Bp patching at the end

dgl-bot · 2023-05-04T13:15:36Z

Commit ID: 4e7094a

Build ID: 13

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

anko-intel · 2023-05-05T06:10:55Z

@frozenbugs, @BarclayII, could you review and merge if there are no issues?

frozenbugs

@Rhett-Ying Do we have benchmark for COOToCSR? If not, can you guide @anko-intel to add a benchmark test?

src/array/cpu/spmat_op_impl_coo.cc

tests/cpp/test_spmat_coo.cc

src/array/cpu/spmat_op_impl_coo.cc

Rhett-Ying · 2023-05-06T03:05:15Z

@Rhett-Ying Do we have benchmark for COOToCSR? If not, can you guide @anko-intel to add a benchmark test?

No. The best way to measure the performance of this change is add c++-level benchmark but we don't have any such framework. As a workaround, we could add python-level benchmarks as here. we could measure within a python-level call which will call COOToCSR in the back. graph.create_formats() or graph.formats() could be an option.

Rhett-Ying · 2023-05-06T03:22:42Z

@dgl-bot run r6i.16xlarge bench_format dmlc/master

dgl-bot · 2023-05-06T03:23:15Z

Start the Regression test. View at https://dgl-jenkins-eksvpc-2136217999.us-west-2.elb.amazonaws.com/job/dgl/job/PR-5508/14/display/redirect

frozenbugs · 2023-05-06T03:31:42Z

@Rhett-Ying Do we have benchmark for COOToCSR? If not, can you guide @anko-intel to add a benchmark test?

No. The best way to measure the performance of this change is add c++-level benchmark but we don't have any such framework. As a workaround, we could add python-level benchmarks as here. we could measure within a python-level call which will call COOToCSR in the back. graph.create_formats() or graph.formats() could be an option.

We already have a benchmark testing COOToCSR, but it only covers 1 case of the 4 cases (sorted, unsortedSmall, unsortedSparse, unsortedDense). It is better to refactor and add the benchmark to our new framework directly. @Rhett-Ying how to add a new benchmark to our new framework?

Rhett-Ying · 2023-05-06T03:42:17Z

how to add to new benchmark framwork

just refer to existing benchmark and add it into https://github.com/dglai/DGL_scripts/tree/master/regression/benchmarks.

frozenbugs · 2023-05-06T03:55:42Z

As discussed w/ @Rhett-Ying offline, it is not easy to add the benchmark to our new framework directly, can you refactor the existing benchmark to cover all 4 cases, and we will migrate them to the new framework once it is ready.

dgl-bot · 2023-05-06T04:47:40Z

Finished the Regression test. Result table is at https://dgl-asv-data.s3-us-west-2.amazonaws.com/4e7094a953ce8c056858e5907de098cae9ac79a9_r6i16xlarge/results/result.csv. Jenkins job link is https://dgl-jenkins-eksvpc-2136217999.us-west-2.elb.amazonaws.com/job/dgl/job/PR-5508/14/display/redirect.

dgl-bot · 2023-05-06T04:47:52Z

Commit ID: 4e7094a

Build ID: 14

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

anko-intel

Thank you for your comments.
According to test:
@frozenbugs, I don't have access to https://github.com/dglai/DGL_scripts/tree/master/regression/benchmarks
Maybe a good starting point for c++ test could be the POC of benchmark made on top of gtest ( see https://gist.github.com/anko-intel/b9c4ab9d0eefd2fd496a8ea38a9a11db ) which I use for measurements.
I could try also to modify python benchmark to check performance of all algorithm versions.

src/array/cpu/spmat_op_impl_coo.cc

tests/cpp/test_spmat_coo.cc

src/array/cpu/spmat_op_impl_coo.cc

tests/cpp/test_spmat_coo.cc

frozenbugs · 2023-05-08T16:08:27Z

Thank you for your comments. According to test: @frozenbugs, I don't have access to https://github.com/dglai/DGL_scripts/tree/master/regression/benchmarks Maybe a good starting point for c++ test could be the POC of benchmark made on top of gtest ( see https://gist.github.com/anko-intel/b9c4ab9d0eefd2fd496a8ea38a9a11db ) which I use for measurements. I could try also to modify python benchmark to check performance of all algorithm versions.

Yes, please modify python benchmark to cover all cases.

dgl-bot · 2023-05-10T11:22:54Z

Commit ID: d4ed31cdc893390ee94ee702a9fd57a7137255f9

Build ID: 15

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

anko-intel · 2023-05-10T11:41:54Z

@frozenbugs, I modified the benchmark (b047751) to have results from all COOToCSR algorithm on CPU.

dataset	num nodes	num edges	alg on cpu	when number of available threads is >=
cora	2708	10556	small	14
pubmed	19717	88651	sparse	18
ogbn-arxiv	169343	1166243	sparse	28
livejournal	4847571	68993773	dense	2
friendster	65608366	1806067135	dense	2

On r6i.16xlarge (ami-0343fe6cfc8a09c18 - Ubuntu 20.04) the modified benchmark gives the following results:
(filtered to COO-> SCR|CSC only as these conversions use COOToCSR during time measurement)

dataset	sha:	`ea706ca`	`ccd463d`	before/after	algorithm
cora	('coo','csc')	0.000464917	0.000107054	4.34	small
cora	('coo','csr')	0.000449277	0.000108357	4.15	small
pubmed	('coo','csc')	0.000572143	0.000518983	1.10	sparse
pubmed	('coo','csr')	0.000416732	0.000460249	0.91	sparse
ogbn-arxiv	('coo','csc')	0.002656307	0.002500843	1.06	sparse
ogbn-arxiv	('coo','csr')	0.002499287	0.002125469	1.18	sparse
livejournal	('coo','csc')	0.368533011	0.370530003	0.99	dense
livejournal	('coo','csr')	0.222436191	0.21996336	1.01	dense
friendster	('coo','csc')	13.80159092	13.75633977	1.00	dense
friendster	('coo','csr')	1.096111846	1.089540845	1.01	sorted

Is this benchmark ok for you?

dgl-bot · 2023-05-10T12:04:28Z

Commit ID: d89cc24767e75ed1b5871d1a737d8db9adbb1fd1

Build ID: 16

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

benchmarks/benchmarks/api/bench_format_conversion.py

frozenbugs · 2023-05-10T17:04:42Z

@frozenbugs, I modified the benchmark (b047751) to have results from all COOToCSR algorithm on CPU.

dataset num nodes num edges alg on cpu when number of available threads is >=
cora 2708 10556 small 14
pubmed 19717 88651 sparse 18
ogbn-arxiv 169343 1166243 sparse 28
livejournal 4847571 68993773 dense 2
friendster 65608366 1806067135 dense 2
On r6i.16xlarge (ami-0343fe6cfc8a09c18 - Ubuntu 20.04) the modified benchmark gives the following results: (filtered to COO-> SCR|CSC only as these conversions use COOToCSR during time measurement)

dataset sha: ea706ca ccd463d before/after algorithm
cora ('coo','csc') 0.000464917 0.000107054 4.34 small
cora ('coo','csr') 0.000449277 0.000108357 4.15 small
pubmed ('coo','csc') 0.000572143 0.000518983 1.10 sparse
pubmed ('coo','csr') 0.000416732 0.000460249 0.91 sparse
ogbn-arxiv ('coo','csc') 0.002656307 0.002500843 1.06 sparse
ogbn-arxiv ('coo','csr') 0.002499287 0.002125469 1.18 sparse
livejournal ('coo','csc') 0.368533011 0.370530003 0.99 dense
livejournal ('coo','csr') 0.222436191 0.21996336 1.01 dense
friendster ('coo','csc') 13.80159092 13.75633977 1.00 dense
friendster ('coo','csr') 1.096111846 1.089540845 1.01 sorted
Is this benchmark ok for you?

Thanks! the result looks great!

Co-authored-by: Hongzhi (Steve), Chen <chenhongzhi.nkcs@gmail.com>

anko-intel force-pushed the dev/anko_coo2csr_6 branch from cf8f18b to 53f3b18 Compare March 29, 2023 12:00

anko-intel force-pushed the dev/anko_coo2csr_6 branch from 8e21e6a to ae93c02 Compare March 30, 2023 11:08

anko-intel force-pushed the dev/anko_coo2csr_6 branch from 0dce646 to 4b342f5 Compare April 4, 2023 10:41

frozenbugs requested review from frozenbugs and czkkkkkk April 6, 2023 02:26

nv-dlasalle reviewed Apr 8, 2023

View reviewed changes

src/array/cpu/spmat_op_impl_coo.cc Show resolved Hide resolved

bgawrych approved these changes Apr 26, 2023

View reviewed changes

src/array/cpu/spmat_op_impl_coo.cc Show resolved Hide resolved

anko-intel force-pushed the dev/anko_coo2csr_6 branch from 389924a to 6a8ae32 Compare April 28, 2023 17:03

anko-intel added 5 commits May 4, 2023 13:28

[TESTS] Add test comparing all COOToCSR CPU alg versions

a33d9bb

+ remove COOToCSR checks which are intended to ensure that each version of algorythm is tested. But it doesn't check it

Fix issue with OMP on Windows

cca8887

Apply review suggestion

2e904e4

Fix potential OMP issue also on Linux

b94f8bc

frozenbugs reviewed May 5, 2023

View reviewed changes

Aplly review comments

ccd463d

anko-intel commented May 8, 2023

View reviewed changes

frozenbugs closed this May 8, 2023

frozenbugs reopened this May 8, 2023

Modyfy benchmark to cover all CPU versions of COOToCSR algorithm

b047751

anko-intel force-pushed the dev/anko_coo2csr_6 branch from d9082ec to b047751 Compare May 10, 2023 11:21

frozenbugs reviewed May 10, 2023

View reviewed changes

benchmarks/benchmarks/api/bench_format_conversion.py Outdated Show resolved Hide resolved

frozenbugs approved these changes May 10, 2023

View reviewed changes

frozenbugs added 2 commits May 11, 2023 01:05

Update benchmarks/benchmarks/api/bench_format_conversion.py

7509609

Merge branch 'master' into dev/anko_coo2csr_6

614a25c

frozenbugs merged commit e0d2250 into dmlc:master May 10, 2023

anko-intel deleted the dev/anko_coo2csr_6 branch May 11, 2023 06:32

Rhett-Ying pushed a commit that referenced this pull request Jun 20, 2023

[Performance] Improve COOToCSR implementation (#5508)

5619dfc

Co-authored-by: Hongzhi (Steve), Chen <chenhongzhi.nkcs@gmail.com>

DominikaJedynak pushed a commit to DominikaJedynak/dgl that referenced this pull request Mar 12, 2024

[Performance] Improve COOToCSR implementation (dmlc#5508)

aa4796b

Co-authored-by: Hongzhi (Steve), Chen <chenhongzhi.nkcs@gmail.com>

[Performance] Improve COOToCSR implementation #5508

[Performance] Improve COOToCSR implementation #5508

Conversation

anko-intel commented Mar 29, 2023 • edited Loading

Description

Checklist

Changes

dgl-bot commented Mar 29, 2023

dgl-bot commented Mar 29, 2023

dgl-bot commented Mar 29, 2023

anko-intel commented Mar 29, 2023

dgl-bot commented Mar 29, 2023

dgl-bot commented Mar 30, 2023

anko-intel commented Mar 30, 2023

anko-intel commented Mar 30, 2023

dgl-bot commented Mar 30, 2023

dgl-bot commented Apr 3, 2023

dgl-bot commented Apr 4, 2023

dgl-bot commented Apr 4, 2023

frozenbugs commented Apr 6, 2023

dgl-bot commented Apr 26, 2023

bgawrych left a comment

Choose a reason for hiding this comment

dgl-bot commented Apr 28, 2023

dgl-bot commented Apr 28, 2023

anko-intel commented Apr 28, 2023

dgl-bot commented May 4, 2023

anko-intel commented May 5, 2023

frozenbugs left a comment

Choose a reason for hiding this comment

Rhett-Ying commented May 6, 2023

Rhett-Ying commented May 6, 2023

dgl-bot commented May 6, 2023

frozenbugs commented May 6, 2023

Rhett-Ying commented May 6, 2023 • edited Loading

frozenbugs commented May 6, 2023 • edited Loading

dgl-bot commented May 6, 2023

dgl-bot commented May 6, 2023

anko-intel left a comment • edited Loading

Choose a reason for hiding this comment

frozenbugs commented May 8, 2023

dgl-bot commented May 10, 2023

anko-intel commented May 10, 2023 • edited Loading

dgl-bot commented May 10, 2023

frozenbugs commented May 10, 2023

anko-intel commented Mar 29, 2023 •

edited

Loading

Rhett-Ying commented May 6, 2023 •

edited

Loading

frozenbugs commented May 6, 2023 •

edited

Loading

anko-intel left a comment •

edited

Loading

anko-intel commented May 10, 2023 •

edited

Loading