GitHub - cksun-usc/benchm-dplyr-dt

Simple/basic/limited/incomplete benchmark for dplyr and data.table

For parameters n = 10M, 100M and m = 100, 10K, 1M, create data.frames

d <- data.frame(x = sample(m, n, replace=TRUE), y = runif(n))
dm <- data.frame(x = sample(m))

and corresponding data.tables with and without key on x (d's size in RAM is around 100MB and 1GB, respectively).

The basic tabular operations (filter, aggregate, join etc.) are applied using base, dplyr (with data.frame and data.table backends, with and without key for data.table) and standard data.table (with and without key).

This is just a simple/basic/limited/incomplete benchmark, could do more with various data types (e.g. character), several grouping variables (x1,x2,...), more values for size parameters (n,m), different distributions of values in the data.frames etc. (or with real-world datasets).

Filter

d[d$x>=10 & d$x<20,]
d %>% filter(x>=10, x<20)
dt[x>=10 & x<20]

Sort

d[order(d$x),]
d %>% arrange(x)
dt[order(x)]

New column

d$y2 <- 2*d$y
d %>% mutate(y2 = 2*y)
dt[,y2 := 2*y]

Aggregation

tapply(d$y, d$x, mean)
d %>% group_by(x) %>% summarize(ym = mean(y))
dt[, mean(y), by=x]

Join

merge(d, dm, by="x")
d %>% inner_join(dm, by="x")
dt[dtm, nomatch=0]

Results

Full code in bm.Rmd and results for each n,m in bm-nxx-mxx.md files in the repo. Latest CRAN versions of R, dplyr and data.table have been used (R 3.1.1, dplyr 0.3.0.2 and data.table 1.9.4). A summary of results (relative running times, lower is better) is here:

	base	dplyr-df	dplyr-dt	dplyr-dt-k	dt	dt-k
Filter	2	1	1	1	1	1
Sort	30-60	20-30	1.5-3	1	1.5-3	1
New column	1	1	6 4	6 4	4 1	4 1
Aggregation	8-100	4-30	4-6	1.5	1.5-5	1
Join	>100	4-15	4-6	1.5-2.5	-	1

(the larger numbers are usually for larger m, i.e. lots of small groups)

Discussion:

Having a key (which for data.table it means having the data pre-sorted in place) obviously helps with sorting, aggregation and joins (depending on the use case though, the time to generate the key should be added to the timing)
dplyr with data.table backend/source is almost as fast as plain data.table (because in this case dplyr acts as a wrapper and calls data.table functions behind the scenes) - so, you can kindda have both: dplyr API (my personal preference) and speed
dplyr with data.frame source is slower than data.table for sort, aggregation and joins. Some of this has apparently to do with radix sort and binary search joining (data.table) being faster than hash-table based joins (dplyr) as described here, but some of it is likely to be improved as Hadley said here.
~~Defining a new column in data.table (or dplyr with the data.table backend) is slower. I pointed out this to data.table developers Matt and Arun and this can be fixed.~~ The extra slowdown in creating a new column with dplyr with data.table source (vs plain data.table) can also be fixed.

More info:

~~I'm going to give~~ I gave a short 15-min talk at the LA R meetup about dplyr, and ~~I'll talk~~ I talked about these results as well, slides here.

There are several other benchmarks, for example Matt's benchmark of group-by, or Brodie Gaslam's benchmark of group-by and mutate. My goal was to look at a wider range of operations (but keep the work minimal, so I had to concentrate on a few samples) - and I also wanted to understand the reasons for such performance, and in this respect I'd like to thank the developers for the useful pointers.

Python's pandas:

Besides R, Python is almost as widely used for data analysis nowadays (and see how they dominate in the DataScience.LA data science toolbox survey).

It looks like Python's pandas (0.15.1) is slower than data.table for both aggregates and joins (contrary to measurements/claims from almost 3 years ago). For example for n = 10M and m = 1M runtimes (in seconds, lower is better):

	pandas	data.table
Aggregate	1.5	1
Aggregate (keys)	0.4	0.2
Join	5.9	-
Join (keys)	2.1	0.5
Creating keys	3.7	0.7

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.gitignore		.gitignore
README.md		README.md
bm-n100m-m100.md		bm-n100m-m100.md
bm-n100m-m10k.md		bm-n100m-m10k.md
bm-n100m-m1m.md		bm-n100m-m1m.md
bm-n10m-m100.md		bm-n10m-m100.md
bm-n10m-m10k.md		bm-n10m-m10k.md
bm-n10m-m1m.md		bm-n10m-m1m.md
bm-pd.ipynb		bm-pd.ipynb
bm.Rmd		bm.Rmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple/basic/limited/incomplete benchmark for dplyr and data.table

Filter

Sort

New column

Aggregation

Join

Results

Discussion:

More info:

Python's pandas:

About

Releases

Packages

Languages

cksun-usc/benchm-dplyr-dt

Folders and files

Latest commit

History

Repository files navigation

Simple/basic/limited/incomplete benchmark for dplyr and data.table

Filter

Sort

New column

Aggregation

Join

Results

Discussion:

More info:

Python's pandas:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages