New data frame #2994

thomasp85 · 2018-11-12T14:25:23Z

This is a very sweeping PR, that changes almost all data.frame() and as.data.frame() calls to use the included new_data_frame() constructor.

unit tests and examples does not show any problems with this, but it is possible that reverse dependency check will surface some places where this PR introduces errors, so we have to look out for that (or run a revdep check on this branch before merging)

# R/plot-build.r

…w_data_frame

clauswilke · 2018-11-12T15:18:14Z

I'm all on board with avoiding recycling in principle, but it seems to me that giving up on the ability to construct data frames from constants and vectors is a big loss and puts a lot of extra mental requirements on the code writer. I think this will almost certainly result in new bugs at some point.

Would it be possible/make sense to rewrite the constructor with Rcpp and do a basic sanity check (all input vectors are the same length or constants) and recycle constants only? Would that generate a lot of overhead?

I did an experiment a while back and if I remember correctly I could construct basic R objects such as structures quite a bit faster using Rcpp than using structure() and list().

thomasp85 · 2018-11-12T15:23:20Z

structure() carries its own overhead so I'm not surprised you could beat that...

I've toyed with the idea of a recycling list constructor so that you could do something like this:

new_data_frame(list_recycle(
  letters = c('a', 'b', 'c'),
  one_integer = 1L
))

this would keep the minimal constructor but allow some ease when creating data frames from a mix of constants and vectors

thomasp85 · 2018-11-12T15:25:02Z

I would generally not recommend going into Rcpp just for something like this - the overhead of .Call() would very likely offset any gain

hadley

I think new_data_frame() should probably check the lengths — I think you should be able to do that without too much of a performance hit. Otherwise, I think the chances of accidentally creating an invalid data frame are highly. It might also be worth considering:

data_frame <- function(...) new_data_frame(list(...))
data.frame <- function(...) stop("Use `data_frame()` for better performance")

Or maybe data_frame() should check lengths and then we can it use most of the time, reserving new_data_frame() for the most performance critical applications.

R/axis-secondary.R

hadley · 2018-11-12T15:54:47Z

I think the long term performance plan is to rely on vctrs or tibble for the low-level code.

@thomasp85 for performance PRs like this, could you please include a small summary of the change in the performance in the PR?

clauswilke · 2018-11-12T16:49:36Z

One other issue: If we're starting to introduce performant replacements for widely used R functions then I think recommended use, best practices, and potential pitfalls need to be clearly documented. Not sure where this would go, but maybe it's time to start a vignette that describes internal coding practices for ggplot2?

hadley · 2018-11-12T20:39:37Z

@clauswilke that what I'm thinking with the data.frame() shim I described above — that makes it a good place to include documentation, because you're forced to read it when you accidentally use the other function. In my experience, project documentation alone is not sufficient (even when I'm the only person contributing code 🤣)

clauswilke · 2018-11-12T23:01:15Z

@hadley I think we need all of the above. The proposed plan for data.frame() is good, but I don't think it'll make sense to hide all relevant documentation in that one error message. I'd propose something like the following:

data.frame <- function(...) stop('Please use `data_frame()` instead of `data.frame()` for better performance. See the vignette "ggplot2 internal programming guidelines" for details.')

There are also things that I can think of adding to such a vignette where one wouldn't be able to create such a check, such as best practices on working with aesthetics.

…me version and guard data.frame

thomasp85 · 2018-11-13T11:06:20Z

new_data_frame now does recycling and takes n as the longest of the elements in x if not given. data_frame is simply new_data_frame(list(...)), and data.frame has been masked with the error message provided by @clauswilke (we'll need to write that vignette obviously).

The overhead of recycling was minor (~1µs) so I think the safety of it makes it fair to have in both constructors.

One unexpected side effect of masking data.frame() was that it also affected all test code - it has been updated accordingly.

The performance benefit of this is very much related to the nature of the plot. In a standard scatter-plot it is going to be non-existing (or at least hidden by more taxing operations) as there is only 5 data frames being constructed. The moment you begin to have faceting along with stats implementing compute_group and geoms implementing draw_group it becomes significant, and I could easily get a 20% reduction in ggplotGrob() time on a facetted boxplot of the diamonds dataset

R/aaa-.r

thomasp85 · 2018-11-13T14:27:33Z

Recycling now only applies to scalars. vectors of other length than the final number of rows will throw an error

thomasp85 · 2018-11-15T09:24:14Z

@hadley can you look this through again and see if you are content with the implementation?

R/aaa-.r

R/annotation-logticks.r

thomasp85 · 2018-11-15T20:07:01Z

ok, @hadley everything should be fixed now

clauswilke · 2018-11-15T21:41:43Z

There's one test remaining that uses data.frame(). Looks like a race condition between your commit and @yutannihilation's recent commit that introduced that line.

ggplot2/tests/testthat/test-layer.r

Line 42 in 4380cb9

df <- data.frame(x = 1:10)

thomasp85 · 2018-11-15T21:58:22Z

yeah, I've fixed this in #3003

lock · 2019-05-14T22:30:39Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

thomasp85 added 22 commits January 24, 2016 22:08

Resolve data upon extraction within ggplot_build

c2b7e0c

Merge remote-tracking branch 'hadley/master'

3b5eed9

# Conflicts:

e789d26

# R/plot-build.r

Merge branch 'hadley-master'

5e7fe1e

Merge remote-tracking branch 'origin/master'

eb24025

Merge remote-tracking branch 'hadley/master'

6287025

Merge branch 'tidyverse/master'

90c5da0

Merge remote-tracking branch 'tidyverse/master'

39dfb1f

Memoize calls to descentDetails()

d10d2e8

Merge remote-tracking branch 'upstream/master'

0eff97a

Merge branch 'memoise-descent' into new_data_frame

7c7d492

sub data.frame with new_data_frame in backbone functions

1476c6f

Merge branch 'master' of https://github.com/tidyverse/ggplot2 into ne…

44f2d9a

…w_data_frame

Update constructor API

4af7f83

Remove data.frame calls in favour of new_data_frame

5a92b7a

Last effort to squash data.frame()

05d0484

memoise by the current device as well

47ef11d

import dev.cur

72c351f

Merge branch 'memoise-descent' into new_data_frame

e14055c

Remove tibble() where relevant

a3bee4d

Merge branch 'master' of https://github.com/tidyverse/ggplot2 into ne…

a5a4bdd

…w_data_frame

Add description to vignette

50d1a7d

thomasp85 added the performance label Nov 12, 2018

thomasp85 requested a review from hadley November 12, 2018 14:25

hadley reviewed Nov 12, 2018

View reviewed changes

R/axis-secondary.R Outdated Show resolved Hide resolved

thomasp85 added 2 commits November 13, 2018 11:12

Change data.frame constructor to do automatic recycling. Add data_fra…

8606fff

…me version and guard data.frame

Update tests to use data_frame instead of data.frame

d3ccd4c

hadley reviewed Nov 13, 2018

View reviewed changes

R/aaa-.r Outdated Show resolved Hide resolved

R/aaa-.r Outdated Show resolved Hide resolved

R/aaa-.r Outdated Show resolved Hide resolved

More strict recycling. Check for named input

dae8d4c

hadley reviewed Nov 15, 2018

View reviewed changes

R/aaa-.r Outdated Show resolved Hide resolved

R/aaa-.r Show resolved Hide resolved

R/aaa-.r Outdated Show resolved Hide resolved

R/annotation-logticks.r Outdated Show resolved Hide resolved

thomasp85 added 3 commits November 15, 2018 20:56

cleaner mat_2_col implementation

57ffd8e

Removed unnessecary rep()

eb82d19

Remove stray stringAsFactors

9efcffb

hadley approved these changes Nov 15, 2018

View reviewed changes

thomasp85 merged commit 92d2777 into tidyverse:master Nov 15, 2018

karawoo mentioned this pull request Jan 15, 2019

data.frame override hinders local development #3067

Closed

lock bot locked and limited conversation to collaborators May 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New data frame #2994

New data frame #2994

thomasp85 commented Nov 12, 2018

clauswilke commented Nov 12, 2018

thomasp85 commented Nov 12, 2018

thomasp85 commented Nov 12, 2018

hadley left a comment

hadley commented Nov 12, 2018

clauswilke commented Nov 12, 2018

hadley commented Nov 12, 2018

clauswilke commented Nov 12, 2018

thomasp85 commented Nov 13, 2018 •

edited

Loading

thomasp85 commented Nov 13, 2018

thomasp85 commented Nov 15, 2018

thomasp85 commented Nov 15, 2018

clauswilke commented Nov 15, 2018

thomasp85 commented Nov 15, 2018

lock bot commented May 14, 2019

New data frame #2994

New data frame #2994

Conversation

thomasp85 commented Nov 12, 2018

clauswilke commented Nov 12, 2018

thomasp85 commented Nov 12, 2018

thomasp85 commented Nov 12, 2018

hadley left a comment

Choose a reason for hiding this comment

hadley commented Nov 12, 2018

clauswilke commented Nov 12, 2018

hadley commented Nov 12, 2018

clauswilke commented Nov 12, 2018

thomasp85 commented Nov 13, 2018 • edited Loading

thomasp85 commented Nov 13, 2018

thomasp85 commented Nov 15, 2018

thomasp85 commented Nov 15, 2018

clauswilke commented Nov 15, 2018

thomasp85 commented Nov 15, 2018

lock bot commented May 14, 2019

thomasp85 commented Nov 13, 2018 •

edited

Loading