Expose sample on the rioxarray accessor #800

martinfleis · 2024-08-29T08:46:01Z

I was recently comparing the performance of point sampling from a raster in xvec (xarray-contrib/xvec#81) and learned that when using xarray's sel, the sampling is about 40x slower than if I use rasterio's sample method (see the notebook in the linked issue).

I can possibly use sample from the rasterio object available within the rio._manager as list(dtm_da.rio._manager.acquire().sample(list(zip(x, y)))) but that is using a private API of rioxarray.

Would it be in scope to expose sample directly on the rio accessor? We could then consume it in xvec when dealing with rioxarray-backed DataArrays.

The text was updated successfully, but these errors were encountered:

snowman2 · 2024-08-29T13:27:26Z

rio._manager is something that is easily lost after performing operations on the DataArray|Dataset and is only available when the DataArray|Dataset is opened with rioxarray.open_rasterio or when using engine="rasterio". Unfortunately, due to these limitations, exposing sample on the rio accessor would likely cause confusion as it would not always work as expected.

I would be interested to know if this method is any faster.

If it is, that may enable us to add rio.sample:

def sample(self, *args, **kwargs):
    try:
       return self._manager.acquire().sample(*args, **kwargs)
    except AttributeError:
        pass
    with MemoryFile() as memfile:
        self._xds.rio.to_raster(memfile.name)
        with memfile.open() as dst:
             return dst.sample(*args, **kwargs)

martinfleis · 2024-08-29T14:19:52Z

Using the DTM from the notebook above, the trip via MemoryFile takes 36s, about 4x more than using the built-in sel method of xarray, so there's no point in doing that.

It seems that exposing rasterio's sample is not a feasible option. Thanks anyway!

snowman2 · 2024-08-29T19:59:55Z

These observations are also useful to be aware of: xarray-contrib/xvec#81 (comment)

Thanks for the proposal!

scottyhq · 2025-01-06T13:40:57Z

I just looked into this a little bit, and notice that selecting points via xarray first loads the all pixels a buffered envelope around all the points, rather than the rasterio.sample approach of serially loading points as single pixel windowed reads. This is a big contrast if you open a large remote VRT!

import logging
import rasterio
import xarray as xr

logging.basicConfig(level=logging.DEBUG,
                    handlers=[logging.StreamHandler()]
                    )

# Size is 1296001, 417601
url = 'https://opentopography.s3.sdsc.edu/raster/SRTM_GL1_Ellip/SRTM_GL1_Ellip_srtm.vrt'

ENV = rasterio.Env(GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR',
                   CPL_DEBUG=True,
                   )

# Sample a few points spread far apart 
# https://docs.xarray.dev/en/latest/user-guide/interpolation.html#advanced-interpolation
points = [(-122.33, 47.60), (-74.00, 40.71), (-95.37, 29.76)]
xi = xr.DataArray([p[0] for p in points], dims="z")
yi = xr.DataArray([p[1] for p in points], dims="z")

with rasterio.open(url) as src:
    samples_rasterio = list(rasterio.sample.sample_gen(src, points))
# IO window xoff=304668.0 yoff=108864.0 width=1.0 height=1.0
# IO window xoff=381600.0 yoff=69444.0 width=1.0 height=1.0
# IO window xoff=304668.0 yoff=108864.0 width=1.0 height=1.0

# NOTE: xarray tries to load a large buffered window arround all points
# This will likely crash, requiring too much memory
daOT = xr.open_dataarray(url, engine='rasterio').squeeze()
samples_xarray = daOT.sel(x=xi, y=yi, method='nearest').compute()
# IO window xoff=207612.0 yoff=44640.0 width=173989.0 height=64225.0

@benbovy, do you know if there is a way to make xarray only sample neighborhoods around points? Or is this behavior unique to rioxarray? I admit I haven't looked at the code close enough to know where that envelope window is coming from...

scottyhq · 2025-01-06T13:52:10Z

I think it would be nice to have a built-in function to make this simple and efficient with rioxarray (or xvec) as it's a common use-case. The workaround I'm currently using is to compute windows ahead of time and use isel_window to extract values:

# Loop over custom windows 
# NOTE: could be various sizes to enable bilinear interpolation etc
# e.g. https://github.com/kylebarron/demquery/blob/5871ad22541c7645b456a8fb998dc2be57b5ef91/demquery/demquery.py#L43
from rasterio.windows import Window
values = []
with rasterio.open(url) as src:
    for point in points:
        row, col = src.index(point[0], point[1])
        window = Window(col, row, 1, 1)
        values.append(daOT.rio.isel_window(window).to_numpy())

Edit: actually, it's more straightforward to just loop & stick with da.sel (single points trigger IO windows with width=1.0 height=1.0):

pts_da = np.zeros_like(xi)
for i,p in enumerate(points):
    pts_da[i] = dtm_da.sel(x=p[0], y=p[1], method="nearest").squeeze().to_numpy()

benbovy · 2025-01-06T15:20:12Z

@benbovy, do you know if there is a way to make xarray only sample neighborhoods around points? Or is this behavior unique to rioxarray?

This seems to be handled by

rioxarray/rioxarray/_io.py

Line 291 in 4745aec

class RasterioArrayWrapper(BackendArray):

More specifically RasterioArrayWrapper._get_indexer(key) returns a single window regardless of the input key. For vectorized indexing this is likely not optimal in most cases and yields very poor performance in cases like sampling a huge DTM with some arbitrarily located points spread far apart.

Maybe RasterioArrayWrapper could be refactored so it separately handles orthogonal vs. vectorized indexing? (see, e.g., the ZarrArrayWrapper, which is another example of a BackendArray subclass, although I don't how this should be properly implemented in 3rd-party IO Xarray backends in the context of the NamedArray refactor). Maybe vectorized indexing could be implemented there by relying on rasterio's sample? Not sure as I'm not familiar enough with rasterio / rioxarray. Computing a window for each input point could be an alternative although it may not be optimal (or even yield bad performance) in many cases too?

dcherian · 2025-01-06T17:30:21Z

rioxarray/rioxarray/_io.py

Line 457 in 4745aec

key, self.shape, indexing.IndexingSupport.OUTER, self._getitem

You'll have to chnage this to IndexingSupport.Vectorized to receive the right inputs to do what you need IIUC. Currently Xarray is rewriting the user query to only use OuterIndexer as requested by this line.

benbovy · 2025-01-06T19:36:08Z

Is it (already) recommended for 3rd-party backend array classes to migrate to .oindex and .vindex?

dcherian · 2025-01-06T21:13:14Z

no anderson and my work is pending a review from Stephan :(

martinfleis added the proposal Idea for a new feature. label Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose sample on the rioxarray accessor #800

Expose sample on the rioxarray accessor #800

martinfleis commented Aug 29, 2024

snowman2 commented Aug 29, 2024

martinfleis commented Aug 29, 2024

snowman2 commented Aug 29, 2024

scottyhq commented Jan 6, 2025 •

edited

Loading

scottyhq commented Jan 6, 2025 •

edited

Loading

benbovy commented Jan 6, 2025 •

edited

Loading

dcherian commented Jan 6, 2025

benbovy commented Jan 6, 2025

dcherian commented Jan 6, 2025

Expose sample on the rioxarray accessor #800

Expose sample on the rioxarray accessor #800

Comments

martinfleis commented Aug 29, 2024

snowman2 commented Aug 29, 2024

martinfleis commented Aug 29, 2024

snowman2 commented Aug 29, 2024

scottyhq commented Jan 6, 2025 • edited Loading

scottyhq commented Jan 6, 2025 • edited Loading

benbovy commented Jan 6, 2025 • edited Loading

dcherian commented Jan 6, 2025

benbovy commented Jan 6, 2025

dcherian commented Jan 6, 2025

scottyhq commented Jan 6, 2025 •

edited

Loading

scottyhq commented Jan 6, 2025 •

edited

Loading

benbovy commented Jan 6, 2025 •

edited

Loading