Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose sample on the rioxarray accessor #800

Open
martinfleis opened this issue Aug 29, 2024 · 9 comments
Open

Expose sample on the rioxarray accessor #800

martinfleis opened this issue Aug 29, 2024 · 9 comments
Labels
proposal Idea for a new feature.

Comments

@martinfleis
Copy link

I was recently comparing the performance of point sampling from a raster in xvec (xarray-contrib/xvec#81) and learned that when using xarray's sel, the sampling is about 40x slower than if I use rasterio's sample method (see the notebook in the linked issue).

I can possibly use sample from the rasterio object available within the rio._manager as list(dtm_da.rio._manager.acquire().sample(list(zip(x, y)))) but that is using a private API of rioxarray.

Would it be in scope to expose sample directly on the rio accessor? We could then consume it in xvec when dealing with rioxarray-backed DataArrays.

@martinfleis martinfleis added the proposal Idea for a new feature. label Aug 29, 2024
@snowman2
Copy link
Member

rio._manager is something that is easily lost after performing operations on the DataArray|Dataset and is only available when the DataArray|Dataset is opened with rioxarray.open_rasterio or when using engine="rasterio". Unfortunately, due to these limitations, exposing sample on the rio accessor would likely cause confusion as it would not always work as expected.

I would be interested to know if this method is any faster.

If it is, that may enable us to add rio.sample:

def sample(self, *args, **kwargs):
    try:
       return self._manager.acquire().sample(*args, **kwargs)
    except AttributeError:
        pass
    with MemoryFile() as memfile:
        self._xds.rio.to_raster(memfile.name)
        with memfile.open() as dst:
             return dst.sample(*args, **kwargs)

@martinfleis
Copy link
Author

Using the DTM from the notebook above, the trip via MemoryFile takes 36s, about 4x more than using the built-in sel method of xarray, so there's no point in doing that.

It seems that exposing rasterio's sample is not a feasible option. Thanks anyway!

@snowman2
Copy link
Member

These observations are also useful to be aware of: xarray-contrib/xvec#81 (comment)

Thanks for the proposal!

@scottyhq
Copy link
Contributor

scottyhq commented Jan 6, 2025

I just looked into this a little bit, and notice that selecting points via xarray first loads the all pixels a buffered envelope around all the points, rather than the rasterio.sample approach of serially loading points as single pixel windowed reads. This is a big contrast if you open a large remote VRT!

import logging
import rasterio
import xarray as xr

logging.basicConfig(level=logging.DEBUG,
                    handlers=[logging.StreamHandler()]
                    )

# Size is 1296001, 417601
url = 'https://opentopography.s3.sdsc.edu/raster/SRTM_GL1_Ellip/SRTM_GL1_Ellip_srtm.vrt'

ENV = rasterio.Env(GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR',
                   CPL_DEBUG=True,
                   )

# Sample a few points spread far apart 
# https://docs.xarray.dev/en/latest/user-guide/interpolation.html#advanced-interpolation
points = [(-122.33, 47.60), (-74.00, 40.71), (-95.37, 29.76)]
xi = xr.DataArray([p[0] for p in points], dims="z")
yi = xr.DataArray([p[1] for p in points], dims="z")

with rasterio.open(url) as src:
    samples_rasterio = list(rasterio.sample.sample_gen(src, points))
# IO window xoff=304668.0 yoff=108864.0 width=1.0 height=1.0
# IO window xoff=381600.0 yoff=69444.0 width=1.0 height=1.0
# IO window xoff=304668.0 yoff=108864.0 width=1.0 height=1.0

# NOTE: xarray tries to load a large buffered window arround all points
# This will likely crash, requiring too much memory
daOT = xr.open_dataarray(url, engine='rasterio').squeeze()
samples_xarray = daOT.sel(x=xi, y=yi, method='nearest').compute()
# IO window xoff=207612.0 yoff=44640.0 width=173989.0 height=64225.0

@benbovy, do you know if there is a way to make xarray only sample neighborhoods around points? Or is this behavior unique to rioxarray? I admit I haven't looked at the code close enough to know where that envelope window is coming from...

@scottyhq
Copy link
Contributor

scottyhq commented Jan 6, 2025

I think it would be nice to have a built-in function to make this simple and efficient with rioxarray (or xvec) as it's a common use-case. The workaround I'm currently using is to compute windows ahead of time and use isel_window to extract values:

# Loop over custom windows 
# NOTE: could be various sizes to enable bilinear interpolation etc
# e.g. https://github.com/kylebarron/demquery/blob/5871ad22541c7645b456a8fb998dc2be57b5ef91/demquery/demquery.py#L43
from rasterio.windows import Window
values = []
with rasterio.open(url) as src:
    for point in points:
        row, col = src.index(point[0], point[1])
        window = Window(col, row, 1, 1)
        values.append(daOT.rio.isel_window(window).to_numpy())

Edit: actually, it's more straightforward to just loop & stick with da.sel (single points trigger IO windows with width=1.0 height=1.0):

pts_da = np.zeros_like(xi)
for i,p in enumerate(points):
    pts_da[i] = dtm_da.sel(x=p[0], y=p[1], method="nearest").squeeze().to_numpy()

@benbovy
Copy link

benbovy commented Jan 6, 2025

@benbovy, do you know if there is a way to make xarray only sample neighborhoods around points? Or is this behavior unique to rioxarray?

This seems to be handled by

class RasterioArrayWrapper(BackendArray):

More specifically RasterioArrayWrapper._get_indexer(key) returns a single window regardless of the input key. For vectorized indexing this is likely not optimal in most cases and yields very poor performance in cases like sampling a huge DTM with some arbitrarily located points spread far apart.

Maybe RasterioArrayWrapper could be refactored so it separately handles orthogonal vs. vectorized indexing? (see, e.g., the ZarrArrayWrapper, which is another example of a BackendArray subclass, although I don't how this should be properly implemented in 3rd-party IO Xarray backends in the context of the NamedArray refactor). Maybe vectorized indexing could be implemented there by relying on rasterio's sample? Not sure as I'm not familiar enough with rasterio / rioxarray. Computing a window for each input point could be an alternative although it may not be optimal (or even yield bad performance) in many cases too?

@dcherian
Copy link

dcherian commented Jan 6, 2025

key, self.shape, indexing.IndexingSupport.OUTER, self._getitem

You'll have to chnage this to IndexingSupport.Vectorized to receive the right inputs to do what you need IIUC. Currently Xarray is rewriting the user query to only use OuterIndexer as requested by this line.

@benbovy
Copy link

benbovy commented Jan 6, 2025

Is it (already) recommended for 3rd-party backend array classes to migrate to .oindex and .vindex?

@dcherian
Copy link

dcherian commented Jan 6, 2025

no anderson and my work is pending a review from Stephan :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Idea for a new feature.
Projects
None yet
Development

No branches or pull requests

5 participants