Skip to content

New data frame #2994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Nov 15, 2018
Merged

New data frame #2994

merged 28 commits into from
Nov 15, 2018

Conversation

thomasp85
Copy link
Member

This is a very sweeping PR, that changes almost all data.frame() and as.data.frame() calls to use the included new_data_frame() constructor.

unit tests and examples does not show any problems with this, but it is possible that reverse dependency check will surface some places where this PR introduces errors, so we have to look out for that (or run a revdep check on this branch before merging)

@clauswilke
Copy link
Member

I'm all on board with avoiding recycling in principle, but it seems to me that giving up on the ability to construct data frames from constants and vectors is a big loss and puts a lot of extra mental requirements on the code writer. I think this will almost certainly result in new bugs at some point.

Would it be possible/make sense to rewrite the constructor with Rcpp and do a basic sanity check (all input vectors are the same length or constants) and recycle constants only? Would that generate a lot of overhead?

I did an experiment a while back and if I remember correctly I could construct basic R objects such as structures quite a bit faster using Rcpp than using structure() and list().

@thomasp85
Copy link
Member Author

structure() carries its own overhead so I'm not surprised you could beat that...

I've toyed with the idea of a recycling list constructor so that you could do something like this:

new_data_frame(list_recycle(
  letters = c('a', 'b', 'c'),
  one_integer = 1L
))

this would keep the minimal constructor but allow some ease when creating data frames from a mix of constants and vectors

@thomasp85
Copy link
Member Author

I would generally not recommend going into Rcpp just for something like this - the overhead of .Call() would very likely offset any gain

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think new_data_frame() should probably check the lengths — I think you should be able to do that without too much of a performance hit. Otherwise, I think the chances of accidentally creating an invalid data frame are highly. It might also be worth considering:

data_frame <- function(...) new_data_frame(list(...))
data.frame <- function(...) stop("Use `data_frame()` for better performance")

Or maybe data_frame() should check lengths and then we can it use most of the time, reserving new_data_frame() for the most performance critical applications.

@hadley
Copy link
Member

hadley commented Nov 12, 2018

I think the long term performance plan is to rely on vctrs or tibble for the low-level code.

@thomasp85 for performance PRs like this, could you please include a small summary of the change in the performance in the PR?

@clauswilke
Copy link
Member

One other issue: If we're starting to introduce performant replacements for widely used R functions then I think recommended use, best practices, and potential pitfalls need to be clearly documented. Not sure where this would go, but maybe it's time to start a vignette that describes internal coding practices for ggplot2?

@hadley
Copy link
Member

hadley commented Nov 12, 2018

@clauswilke that what I'm thinking with the data.frame() shim I described above — that makes it a good place to include documentation, because you're forced to read it when you accidentally use the other function. In my experience, project documentation alone is not sufficient (even when I'm the only person contributing code 🤣)

@clauswilke
Copy link
Member

@hadley I think we need all of the above. The proposed plan for data.frame() is good, but I don't think it'll make sense to hide all relevant documentation in that one error message. I'd propose something like the following:

data.frame <- function(...) stop('Please use `data_frame()` instead of `data.frame()` for better performance. See the vignette "ggplot2 internal programming guidelines" for details.')

There are also things that I can think of adding to such a vignette where one wouldn't be able to create such a check, such as best practices on working with aesthetics.

@thomasp85
Copy link
Member Author

thomasp85 commented Nov 13, 2018

new_data_frame now does recycling and takes n as the longest of the elements in x if not given. data_frame is simply new_data_frame(list(...)), and data.frame has been masked with the error message provided by @clauswilke (we'll need to write that vignette obviously).

The overhead of recycling was minor (~1µs) so I think the safety of it makes it fair to have in both constructors.

One unexpected side effect of masking data.frame() was that it also affected all test code - it has been updated accordingly.

The performance benefit of this is very much related to the nature of the plot. In a standard scatter-plot it is going to be non-existing (or at least hidden by more taxing operations) as there is only 5 data frames being constructed. The moment you begin to have faceting along with stats implementing compute_group and geoms implementing draw_group it becomes significant, and I could easily get a 20% reduction in ggplotGrob() time on a facetted boxplot of the diamonds dataset

@thomasp85
Copy link
Member Author

Recycling now only applies to scalars. vectors of other length than the final number of rows will throw an error

@thomasp85
Copy link
Member Author

@hadley can you look this through again and see if you are content with the implementation?

@thomasp85
Copy link
Member Author

ok, @hadley everything should be fixed now

@thomasp85 thomasp85 merged commit 92d2777 into tidyverse:master Nov 15, 2018
@clauswilke
Copy link
Member

There's one test remaining that uses data.frame(). Looks like a race condition between your commit and @yutannihilation's recent commit that introduced that line.

df <- data.frame(x = 1:10)

@thomasp85
Copy link
Member Author

yeah, I've fixed this in #3003

@lock
Copy link

lock bot commented May 14, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators May 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants