slow `as.data.frame()` conversion from `sf::st_distance()` #361

grcatlin · 2023-12-04T21:31:43Z

I discovered what I believe is some kind of bug today while playing around with distance matrices output by sf::st_distance(). It seems as though retaining units massively slows down performance if attempting to convert to a data.frame.

I noticed this issue while attempting to calculate distances between a couple of points and ~40,000 polygons. I've been able to reproduce the issue with some sampled points below.

library(sf)
library(units)

# seed
set.seed(8675309)

# data
nc <- system.file("gpkg/nc.gpkg", package = "sf") |>
    sf::st_read()

# outline
outline <- nc |>
    sf::st_union() |>
    sf::st_as_sf()

# pts
pts <- outline |>
    sf::st_sample(40000)

# pts 2
pts_2 <- outline |>
    sf::st_sample(2)

# distance matrix
mat <- sf::st_distance(pts_2, pts)

The actual calculation of the distance matrix is extremely fast, but in my use case I wanted to convert it to a data.frame.

mat |> as.data.frame()

However, this takes much longer than it should.

# normal as.data.frame()
to_df <- function() {
    mat |> as.data.frame()
}

# benchmark
rbenchmark::benchmark(to_df(), replications = 5)

> rbenchmark::benchmark(to_df(), replications = 5)
     test replications elapsed relative user.self sys.self user.child sys.child
1 to_df()            5  132.03        1     72.72      1.7         NA        NA

If I, instead, drop the units from the matrix, it runs quickly as expected.

# drop units, then as.data.frame()
to_df_drop_units <- function() {
    mat |>
        units::drop_units() |>
        as.data.frame()
}

# benchmark
rbenchmark::benchmark(to_df_drop_units(), replications = 5)

> rbenchmark::benchmark(to_df_drop_units(), replications = 5)
                test replications elapsed relative user.self sys.self
1 to_df_drop_units()            5     0.2        1       0.2        0

I've been able to reproduce on both my Windows 11 machine and MacOS laptop.

> sessionInfo()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)

other attached packages:
[1] units_0.8-5 sf_1.0-14

loaded via a namespace (and not attached):
 [1] s2_1.1.4           utf8_1.2.4         R6_2.5.1           tidyselect_1.2.0
 [5] e1071_1.7-13       magrittr_2.0.3     glue_1.6.2         tibble_3.2.1
 [9] KernSmooth_2.23-22 pkgconfig_2.0.3    generics_0.1.3     dplyr_1.1.4
[13] wk_0.9.1           lifecycle_1.0.4    classInt_0.4-10    cli_3.6.1
[17] fansi_1.0.5        grid_4.3.2         vctrs_0.6.5        DBI_1.1.3
[21] proxy_0.4-27       class_7.3-22       compiler_4.3.2     tools_4.3.2
[25] pillar_1.9.0       Rcpp_1.0.11        lwgeom_0.2-13      rlang_1.1.2
[29] jsonlite_1.8.8     rbenchmark_1.0.0

The text was updated successfully, but these errors were encountered:

edzer · 2023-12-05T08:17:41Z

Why would you want to convert a distance matrix to a data.frame, and why do you consider this some kind of bug?

Enchufa2 · 2023-12-05T10:18:12Z

I subscribe the question. Your matrix is 2x40000. So basically you are creating 40 thousand columns of units objects. It takes what it takes, and we cannot do less than what's necessary.

grcatlin · 2023-12-05T19:11:22Z

Hi there, thanks for both of your responses!

I had just assumed there was some kind bug since the difference in time between the two was that large - thought I would bring it up in case that behavior wasn't expected!

In my particular use case, I was trying to find the closest polygon to the 2 points. I would've been much better off to use the following.

apply(mat, 1, which.min)

I almost always work with data.frames and usually the cost of conversion doesn't make much of a practical difference which is why I thought something was wrong.

I remember converting to a data.frame for a slightly different problem months ago where I wanted the closest geometry to a point (excluding the point itself) in which case apply(mat, 1, which.min) wouldn't work. I turned into a data.frame and used reshape2::melt() but when reminding myself today, I found out reshape2 is going away. Turns out you can use as.data.frame.table() instead so that use-case ends up being moot. Would be curious to know if there's a better way to do this but that seems like more of an SO question.

# rename col, rows
rownames(mat) <- seq_len(nrow(mat))
colnames(mat) <- seq_len(ncol(mat))

# calculate min
mat |>
    as.data.frame.table(
        stringsAsFactors = FALSE, responseName = "Dist"
    ) |>
    dplyr::filter(Var1 != Var2) |>
    dplyr::group_by(Var1) |>
    dplyr::slice_min(order_by = Dist)

# A tibble: 2 × 3
# Groups:   Var1 [2]
  Var1  Var2   Dist
  <chr> <chr>   [m]
1 1     37394  650.
2 2     15514 2124.

Regardless, I did find a way to speed things up. In as.data.frame.units, if you lapply it seems speeds things up a bunch. I'm sure there are issues with how I did this, but more for proof-of-concept.

# lapply as.data.frame.units
as.data.frame.units.alt <- function(x, row.names = NULL, optional = FALSE, ...) {
    # unchanged
    df <- as.data.frame(unclass(x), row.names, optional)
    if (!optional && ncol(df) == 1) {
        colnames(df) <- deparse(substitute(x))
    }

    # for (i in seq_along(df)) {
    #     units(df[[i]]) <- units(x)
    # }

    # get units from obj
    u <- units(x)

    # get col names (because as.data.frame won't listen)
    col_names <- colnames(df)

    # lapply instead of for loop
    df <- lapply(
        df,
        FUN = function(x) units::as_units(x, value = u)
    ) |>
        as.data.frame()

    # assign names since col.names arg in as.data.frame won't work
    colnames(df) <- col_names

    # return
    return(df)
}

# normal
x <- mat |>
    as.data.frame()

# modified
y <- mat |>
    as.data.frame.units.alt()

# test if equal
all.equal(x, y)

> all.equal(x, y)
[1] TRUE

Here are some benchmarks:

# normal
to_df <- function(x) {
    mat |> as.data.frame()
}

# alternate
to_df_alt <- function(x) {
    mat |>
        as.data.frame.units.alt()
}

# benchmark normal
rbenchmark::benchmark(to_df(), replications = 5)

# benchmark alt
rbenchmark::benchmark(to_df_alt(), replications = 5)

> rbenchmark::benchmark(to_df(), replications = 5)
     test replications elapsed relative user.self sys.self user.child sys.child
1 to_df()            5  244.86        1    210.07     1.36         NA        NA

> rbenchmark::benchmark(to_df_alt(), replications = 5)
         test replications elapsed relative user.self sys.self user.child
1 to_df_alt()            5   13.52        1      9.89     0.23         NA
  sys.child
1        NA

Anyway, thanks for encouraging me to reflect on my insanity! I'll go ahead and close the issue though it does seem like there's speed to be gained since it seems like there's not much of a use-case.

Enchufa2 · 2023-12-06T11:53:39Z

The difference is suprising! Thanks for taking the time to dig into this. In the commit above, I've created a small private helper dfapply and then applied the optimization to a couple of places.

grcatlin closed this as completed Dec 5, 2023

Enchufa2 added a commit that referenced this issue Dec 6, 2023

performance improvements in data.frame methods, as suggested in #361

7439c5c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow `as.data.frame()` conversion from `sf::st_distance()` #361

slow `as.data.frame()` conversion from `sf::st_distance()` #361

grcatlin commented Dec 4, 2023

edzer commented Dec 5, 2023

Enchufa2 commented Dec 5, 2023

grcatlin commented Dec 5, 2023

Enchufa2 commented Dec 6, 2023

slow as.data.frame() conversion from sf::st_distance() #361

slow as.data.frame() conversion from sf::st_distance() #361

Comments

grcatlin commented Dec 4, 2023

edzer commented Dec 5, 2023

Enchufa2 commented Dec 5, 2023

grcatlin commented Dec 5, 2023

Enchufa2 commented Dec 6, 2023

slow `as.data.frame()` conversion from `sf::st_distance()` #361

slow `as.data.frame()` conversion from `sf::st_distance()` #361