-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New data frame #2994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New data frame #2994
Conversation
# R/plot-build.r
I'm all on board with avoiding recycling in principle, but it seems to me that giving up on the ability to construct data frames from constants and vectors is a big loss and puts a lot of extra mental requirements on the code writer. I think this will almost certainly result in new bugs at some point. Would it be possible/make sense to rewrite the constructor with Rcpp and do a basic sanity check (all input vectors are the same length or constants) and recycle constants only? Would that generate a lot of overhead? I did an experiment a while back and if I remember correctly I could construct basic R objects such as structures quite a bit faster using Rcpp than using |
I've toyed with the idea of a recycling list constructor so that you could do something like this: new_data_frame(list_recycle(
letters = c('a', 'b', 'c'),
one_integer = 1L
)) this would keep the minimal constructor but allow some ease when creating data frames from a mix of constants and vectors |
I would generally not recommend going into Rcpp just for something like this - the overhead of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think new_data_frame()
should probably check the lengths — I think you should be able to do that without too much of a performance hit. Otherwise, I think the chances of accidentally creating an invalid data frame are highly. It might also be worth considering:
data_frame <- function(...) new_data_frame(list(...))
data.frame <- function(...) stop("Use `data_frame()` for better performance")
Or maybe data_frame()
should check lengths and then we can it use most of the time, reserving new_data_frame()
for the most performance critical applications.
I think the long term performance plan is to rely on vctrs or tibble for the low-level code. @thomasp85 for performance PRs like this, could you please include a small summary of the change in the performance in the PR? |
One other issue: If we're starting to introduce performant replacements for widely used R functions then I think recommended use, best practices, and potential pitfalls need to be clearly documented. Not sure where this would go, but maybe it's time to start a vignette that describes internal coding practices for ggplot2? |
@clauswilke that what I'm thinking with the |
@hadley I think we need all of the above. The proposed plan for data.frame <- function(...) stop('Please use `data_frame()` instead of `data.frame()` for better performance. See the vignette "ggplot2 internal programming guidelines" for details.') There are also things that I can think of adding to such a vignette where one wouldn't be able to create such a check, such as best practices on working with aesthetics. |
…me version and guard data.frame
The overhead of recycling was minor (~1µs) so I think the safety of it makes it fair to have in both constructors. One unexpected side effect of masking The performance benefit of this is very much related to the nature of the plot. In a standard scatter-plot it is going to be non-existing (or at least hidden by more taxing operations) as there is only 5 data frames being constructed. The moment you begin to have faceting along with stats implementing |
Recycling now only applies to scalars. vectors of other length than the final number of rows will throw an error |
@hadley can you look this through again and see if you are content with the implementation? |
ok, @hadley everything should be fixed now |
There's one test remaining that uses ggplot2/tests/testthat/test-layer.r Line 42 in 4380cb9
|
yeah, I've fixed this in #3003 |
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
This is a very sweeping PR, that changes almost all
data.frame()
andas.data.frame()
calls to use the includednew_data_frame()
constructor.unit tests and examples does not show any problems with this, but it is possible that reverse dependency check will surface some places where this PR introduces errors, so we have to look out for that (or run a revdep check on this branch before merging)