You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Using [ClusterManagers.jl](https://github.com/JuliaParallel/ClusterManagers.jl)
4
+
3
5
The function `pmapreduce` performs a parallel `mapreduce`. This is primarily useful when the function has to perform an expensive calculation, that is the evaluation time per core exceeds the setup and communication time. This is also useful when each core is allocated memory and has to work with arrays that won't fit into memory collectively, as is often the case on a cluster.
4
6
5
7
We walk through an example where we initialize and concatenate arrays in serial and in parallel.
@@ -11,13 +13,15 @@ using ParallelUtilities
11
13
using Distributed
12
14
```
13
15
14
-
We define the function that performs the initialization on each core. This step is embarassingly parallel as no communication happens between workers. We simulate an expensive calculation by adding a sleep interval for each index.
16
+
We define the function that performs the initialization on each core. This step is embarassingly parallel as no communication happens between workers.
15
17
16
18
```julia
17
-
functioninitialize(sleeptime)
18
-
A =Array{Int}(undef, 20, 20)
19
+
functioninitialize(x, n)
20
+
inds =1:n
21
+
d, r =divrem(length(inds), nworkers())
22
+
ninds_local = d + (x <= r)
23
+
A =zeros(Int, 50, ninds_local)
19
24
for ind ineachindex(A)
20
-
sleep(sleeptime)
21
25
A[ind] = ind
22
26
end
23
27
return A
@@ -27,48 +31,67 @@ end
27
31
Next we define the function that calls `pmapreduce`:
We compare the performance of the serial and parallel evaluations using 20 cores on one node:
47
+
We compare the performance of the distributed for loop and the parallel mapreduce using `3` nodes with `28`cores on each node.
44
48
45
49
We define a caller function first
46
50
47
51
```julia
48
52
functioncompare_with_serial()
49
53
# precompile
50
-
main_mapreduce(0)
51
-
main_pmapreduce(0)
54
+
mapreduce_serial(1)
55
+
mapreduce_parallel(nworkers())
52
56
53
57
# time
54
-
println("Tesing serial")
55
-
A =@timemain_mapreduce(5e-6)
56
-
println("Tesing parallel")
57
-
B =@timemain_pmapreduce(5e-6)
58
+
n =2_000_000
59
+
println("Tesing serial mapreduce")
60
+
A =@timemapreduce_serial(n)
61
+
println("Tesing pmapreduce")
62
+
B =@timemapreduce_parallel(n)
58
63
59
64
# check results
60
65
println("Results match : ", A == B)
61
66
end
62
67
```
63
68
64
69
We run this caller on the cluster:
65
-
```julia
66
-
julia>compare_with_serial()
67
-
Tesing serial
68
-
9.457601 seconds (40.14 k allocations:1.934 MiB)
69
-
Tesing parallel
70
-
0.894611 seconds (23.16 k allocations:1.355 MiB, 2.56% compilation time)
70
+
```console
71
+
Tesing serial mapreduce
72
+
23.986976 seconds (8.26 k allocations: 30.166 GiB, 11.71% gc time, 0.02% compilation time)
73
+
Tesing pmapreduce
74
+
7.465366 seconds (29.55 k allocations: 764.166 MiB)
71
75
Results match : true
72
76
```
73
77
78
+
In this case the the overall gain is only around a factor of `3`. In general a parallel mapreduce is advantageous if the time required to evaluate the function far exceeds that required to communicate across workers.
79
+
80
+
The time required for a `@distributed``for` loop is unfortunately exceedingly high for it to be practical here.
81
+
74
82
The full script may be found in the examples directory.
83
+
84
+
## Using [MPIClusterManagers.jl](https://github.com/JuliaParallel/MPIClusterManagers.jl)
85
+
86
+
The same script may also be used by initiating an MPI cluster (the cluster in this case has 77 workers + 1 master process). This leads to the timings
87
+
88
+
```console
89
+
Using MPI_TRANSPORT_ALL
90
+
Tesing serial mapreduce
91
+
22.263389 seconds (8.07 k allocations: 29.793 GiB, 11.70% gc time, 0.02% compilation time)
92
+
Tesing pmapreduce
93
+
11.374551 seconds (65.92 k allocations: 2.237 GiB, 0.46% gc time)
94
+
Results match : true
95
+
```
96
+
97
+
The performance is worse in this case than that obtained using `ClusterManagers.jl`.
Copy file name to clipboardExpand all lines: docs/src/examples/sharedarrays.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ using SharedArrays
12
12
using Distributed
13
13
```
14
14
15
-
We create a function to initailize the local part on each worker. In this case we simulate a heavy workload by adding a `sleep` period. In other words we assume that the individual elements of the array are expensive to evaluate. We obtain the local indices of the `SharedArray` through the function `localindices`.
15
+
We create a function to initailize the local part on each worker. In this case we simulate a heavy workload by adding a `sleep` period. In other words we assume that the individual elements of the array are expensive to evaluate. We obtain the local indices of the `SharedArray` through the function `localindices` to split the load among workers.
Copy file name to clipboardExpand all lines: docs/src/examples/threads.md
+31-41
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ We create a function to initailize the local part on each worker. In this case w
13
13
14
14
```julia
15
15
functioninitializenode_threads(sleeptime)
16
-
s =zeros(Int, 2_000)
16
+
s =zeros(Int, 5_000)
17
17
Threads.@threadsfor ind ineachindex(s)
18
18
sleep(sleeptime)
19
19
s[ind] = ind
@@ -22,37 +22,21 @@ function initializenode_threads(sleeptime)
22
22
end
23
23
```
24
24
25
-
We create a main function that runs on the calling process and launches the array initialization task on each node. This is run on a `WorkerPool` consisting of one worker per node which acts as the root process. We may obtain such a pool through the function `ParallelUtilities.workerpool_nodes()`. The array creation step on each node is followed by an eventual concatenation.
25
+
We create a main function that runs on the calling process and launches the array initialization task on each node. The array creation step on each node is followed by an eventual concatenation.
26
26
27
27
```julia
28
-
functionmain_threads(sleeptime)
29
-
# obtain the workerpool with one process on each node
30
-
pool = ParallelUtilities.workerpool_nodes()
31
-
32
-
# obtain the number of workers in the pool.
33
-
nw_nodes =nworkers(pool)
34
-
35
-
# Evaluate the parallel mapreduce
36
-
pmapreduce(x ->initializenode_threads(sleeptime), hcat, pool, 1:nw_nodes)
@@ -61,28 +45,34 @@ We create a function to compare the performance of the two. We start with a prec
61
45
```julia
62
46
functioncompare_with_serial()
63
47
# precompile
64
-
main_serial(0)
65
-
main_threads(0)
66
-
48
+
mapreduce_threads(0)
49
+
mapreduce_distributed_threads(0)
50
+
pmapreduce_threads(0)
67
51
# time
68
-
println("Testing serial")
69
-
A =@timemain_serial(5e-3);
70
-
println("Testing threads")
71
-
B =@timemain_threads(5e-3);
72
-
73
-
println("Results match : ", A == B)
52
+
sleeptime =1e-2
53
+
println("Testing threaded mapreduce")
54
+
A =@timemapreduce_threads(sleeptime);
55
+
println("Testing threaded+distributed mapreduce")
56
+
B =@timemapreduce_distributed_threads(sleeptime);
57
+
println("Testing threaded pmapreduce")
58
+
C =@timepmapreduce_threads(sleeptime);
59
+
60
+
println("Results match : ", A == B == C)
74
61
end
75
62
```
76
63
77
64
We run this script on a Slurm cluster across 2 nodes with 28 cores on each node. The results are:
78
65
79
-
```julia
80
-
julia>compare_with_serial()
81
-
Testing serial
82
-
24.601593 seconds (22.49 k allocations:808.266 KiB)
83
-
Testing threads
84
-
0.666256 seconds (3.71 k allocations:201.703 KiB)
66
+
```console
67
+
Testing threaded mapreduce
68
+
4.161118 seconds (66.27 k allocations: 2.552 MiB, 0.95% compilation time)
69
+
Testing threaded+distributed mapreduce
70
+
2.232924 seconds (48.64 k allocations: 2.745 MiB, 3.20% compilation time)
71
+
Testing threaded pmapreduce
72
+
2.432104 seconds (6.79 k allocations: 463.788 KiB, 0.44% compilation time)
85
73
Results match : true
86
74
```
87
75
88
-
The full script may be found in the examples directory.
76
+
We see that there is little difference in evaluation times between the `@distributed` reduction and `pmapreduce`, both of which are roughly doubly faster than the one-node evaluation.
77
+
78
+
The full script along with the Slurm jobscript may be found in the examples directory.
0 commit comments