Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow as.data.frame() conversion from sf::st_distance() #361

Closed
grcatlin opened this issue Dec 4, 2023 · 4 comments
Closed

slow as.data.frame() conversion from sf::st_distance() #361

grcatlin opened this issue Dec 4, 2023 · 4 comments

Comments

@grcatlin
Copy link

grcatlin commented Dec 4, 2023

I discovered what I believe is some kind of bug today while playing around with distance matrices output by sf::st_distance(). It seems as though retaining units massively slows down performance if attempting to convert to a data.frame.

I noticed this issue while attempting to calculate distances between a couple of points and ~40,000 polygons. I've been able to reproduce the issue with some sampled points below.

library(sf)
library(units)

# seed
set.seed(8675309)

# data
nc <- system.file("gpkg/nc.gpkg", package = "sf") |>
    sf::st_read()

# outline
outline <- nc |>
    sf::st_union() |>
    sf::st_as_sf()

# pts
pts <- outline |>
    sf::st_sample(40000)

# pts 2
pts_2 <- outline |>
    sf::st_sample(2)

# distance matrix
mat <- sf::st_distance(pts_2, pts)

The actual calculation of the distance matrix is extremely fast, but in my use case I wanted to convert it to a data.frame.

mat |> as.data.frame()

However, this takes much longer than it should.

# normal as.data.frame()
to_df <- function() {
    mat |> as.data.frame()
}

# benchmark
rbenchmark::benchmark(to_df(), replications = 5)
> rbenchmark::benchmark(to_df(), replications = 5)
     test replications elapsed relative user.self sys.self user.child sys.child
1 to_df()            5  132.03        1     72.72      1.7         NA        NA

If I, instead, drop the units from the matrix, it runs quickly as expected.

# drop units, then as.data.frame()
to_df_drop_units <- function() {
    mat |>
        units::drop_units() |>
        as.data.frame()
}

# benchmark
rbenchmark::benchmark(to_df_drop_units(), replications = 5)
> rbenchmark::benchmark(to_df_drop_units(), replications = 5)
                test replications elapsed relative user.self sys.self
1 to_df_drop_units()            5     0.2        1       0.2        0

I've been able to reproduce on both my Windows 11 machine and MacOS laptop.

> sessionInfo()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)

other attached packages:
[1] units_0.8-5 sf_1.0-14

loaded via a namespace (and not attached):
 [1] s2_1.1.4           utf8_1.2.4         R6_2.5.1           tidyselect_1.2.0
 [5] e1071_1.7-13       magrittr_2.0.3     glue_1.6.2         tibble_3.2.1
 [9] KernSmooth_2.23-22 pkgconfig_2.0.3    generics_0.1.3     dplyr_1.1.4
[13] wk_0.9.1           lifecycle_1.0.4    classInt_0.4-10    cli_3.6.1
[17] fansi_1.0.5        grid_4.3.2         vctrs_0.6.5        DBI_1.1.3
[21] proxy_0.4-27       class_7.3-22       compiler_4.3.2     tools_4.3.2
[25] pillar_1.9.0       Rcpp_1.0.11        lwgeom_0.2-13      rlang_1.1.2
[29] jsonlite_1.8.8     rbenchmark_1.0.0
@edzer
Copy link
Member

edzer commented Dec 5, 2023

Why would you want to convert a distance matrix to a data.frame, and why do you consider this some kind of bug?

@Enchufa2
Copy link
Member

Enchufa2 commented Dec 5, 2023

I subscribe the question. Your matrix is 2x40000. So basically you are creating 40 thousand columns of units objects. It takes what it takes, and we cannot do less than what's necessary.

@grcatlin
Copy link
Author

grcatlin commented Dec 5, 2023

Hi there, thanks for both of your responses!

I had just assumed there was some kind bug since the difference in time between the two was that large - thought I would bring it up in case that behavior wasn't expected!

In my particular use case, I was trying to find the closest polygon to the 2 points. I would've been much better off to use the following.

apply(mat, 1, which.min)

I almost always work with data.frames and usually the cost of conversion doesn't make much of a practical difference which is why I thought something was wrong.

I remember converting to a data.frame for a slightly different problem months ago where I wanted the closest geometry to a point (excluding the point itself) in which case apply(mat, 1, which.min) wouldn't work. I turned into a data.frame and used reshape2::melt() but when reminding myself today, I found out reshape2 is going away. Turns out you can use as.data.frame.table() instead so that use-case ends up being moot. Would be curious to know if there's a better way to do this but that seems like more of an SO question.

# rename col, rows
rownames(mat) <- seq_len(nrow(mat))
colnames(mat) <- seq_len(ncol(mat))

# calculate min
mat |>
    as.data.frame.table(
        stringsAsFactors = FALSE, responseName = "Dist"
    ) |>
    dplyr::filter(Var1 != Var2) |>
    dplyr::group_by(Var1) |>
    dplyr::slice_min(order_by = Dist)
# A tibble: 2 × 3
# Groups:   Var1 [2]
  Var1  Var2   Dist
  <chr> <chr>   [m]
1 1     37394  650.
2 2     15514 2124.

Regardless, I did find a way to speed things up. In as.data.frame.units, if you lapply it seems speeds things up a bunch. I'm sure there are issues with how I did this, but more for proof-of-concept.

# lapply as.data.frame.units
as.data.frame.units.alt <- function(x, row.names = NULL, optional = FALSE, ...) {
    # unchanged
    df <- as.data.frame(unclass(x), row.names, optional)
    if (!optional && ncol(df) == 1) {
        colnames(df) <- deparse(substitute(x))
    }

    # for (i in seq_along(df)) {
    #     units(df[[i]]) <- units(x)
    # }

    # get units from obj
    u <- units(x)

    # get col names (because as.data.frame won't listen)
    col_names <- colnames(df)

    # lapply instead of for loop
    df <- lapply(
        df,
        FUN = function(x) units::as_units(x, value = u)
    ) |>
        as.data.frame()

    # assign names since col.names arg in as.data.frame won't work
    colnames(df) <- col_names

    # return
    return(df)
}
# normal
x <- mat |>
    as.data.frame()

# modified
y <- mat |>
    as.data.frame.units.alt()

# test if equal
all.equal(x, y)
> all.equal(x, y)
[1] TRUE

Here are some benchmarks:

# normal
to_df <- function(x) {
    mat |> as.data.frame()
}

# alternate
to_df_alt <- function(x) {
    mat |>
        as.data.frame.units.alt()
}

# benchmark normal
rbenchmark::benchmark(to_df(), replications = 5)

# benchmark alt
rbenchmark::benchmark(to_df_alt(), replications = 5)
> rbenchmark::benchmark(to_df(), replications = 5)
     test replications elapsed relative user.self sys.self user.child sys.child
1 to_df()            5  244.86        1    210.07     1.36         NA        NA
> rbenchmark::benchmark(to_df_alt(), replications = 5)
         test replications elapsed relative user.self sys.self user.child
1 to_df_alt()            5   13.52        1      9.89     0.23         NA
  sys.child
1        NA

Anyway, thanks for encouraging me to reflect on my insanity! I'll go ahead and close the issue though it does seem like there's speed to be gained since it seems like there's not much of a use-case.

@Enchufa2
Copy link
Member

Enchufa2 commented Dec 6, 2023

The difference is suprising! Thanks for taking the time to dig into this. In the commit above, I've created a small private helper dfapply and then applied the optimization to a couple of places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants