-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow as.data.frame()
conversion from sf::st_distance()
#361
Comments
Why would you want to convert a distance matrix to a data.frame, and why do you consider this some kind of bug? |
I subscribe the question. Your matrix is 2x40000. So basically you are creating 40 thousand columns of units objects. It takes what it takes, and we cannot do less than what's necessary. |
Hi there, thanks for both of your responses! I had just assumed there was some kind bug since the difference in time between the two was that large - thought I would bring it up in case that behavior wasn't expected! In my particular use case, I was trying to find the closest polygon to the 2 points. I would've been much better off to use the following. apply(mat, 1, which.min) I almost always work with I remember converting to a # rename col, rows
rownames(mat) <- seq_len(nrow(mat))
colnames(mat) <- seq_len(ncol(mat))
# calculate min
mat |>
as.data.frame.table(
stringsAsFactors = FALSE, responseName = "Dist"
) |>
dplyr::filter(Var1 != Var2) |>
dplyr::group_by(Var1) |>
dplyr::slice_min(order_by = Dist)
Regardless, I did find a way to speed things up. In # lapply as.data.frame.units
as.data.frame.units.alt <- function(x, row.names = NULL, optional = FALSE, ...) {
# unchanged
df <- as.data.frame(unclass(x), row.names, optional)
if (!optional && ncol(df) == 1) {
colnames(df) <- deparse(substitute(x))
}
# for (i in seq_along(df)) {
# units(df[[i]]) <- units(x)
# }
# get units from obj
u <- units(x)
# get col names (because as.data.frame won't listen)
col_names <- colnames(df)
# lapply instead of for loop
df <- lapply(
df,
FUN = function(x) units::as_units(x, value = u)
) |>
as.data.frame()
# assign names since col.names arg in as.data.frame won't work
colnames(df) <- col_names
# return
return(df)
} # normal
x <- mat |>
as.data.frame()
# modified
y <- mat |>
as.data.frame.units.alt()
# test if equal
all.equal(x, y)
Here are some benchmarks: # normal
to_df <- function(x) {
mat |> as.data.frame()
}
# alternate
to_df_alt <- function(x) {
mat |>
as.data.frame.units.alt()
}
# benchmark normal
rbenchmark::benchmark(to_df(), replications = 5)
# benchmark alt
rbenchmark::benchmark(to_df_alt(), replications = 5)
Anyway, thanks for encouraging me to reflect on my insanity! I'll go ahead and close the issue though it does seem like there's speed to be gained since it seems like there's not much of a use-case. |
The difference is suprising! Thanks for taking the time to dig into this. In the commit above, I've created a small private helper |
I discovered what I believe is some kind of bug today while playing around with distance matrices output by
sf::st_distance()
. It seems as though retainingunits
massively slows down performance if attempting to convert to adata.frame
.I noticed this issue while attempting to calculate distances between a couple of points and ~40,000 polygons. I've been able to reproduce the issue with some sampled points below.
The actual calculation of the distance matrix is extremely fast, but in my use case I wanted to convert it to a
data.frame
.However, this takes much longer than it should.
If I, instead, drop the units from the matrix, it runs quickly as expected.
I've been able to reproduce on both my Windows 11 machine and MacOS laptop.
The text was updated successfully, but these errors were encountered: