Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: convenience functions to access variable coords #200

Closed
durack1 opened this issue Mar 18, 2022 · 11 comments
Closed

[FEATURE]: convenience functions to access variable coords #200

durack1 opened this issue Mar 18, 2022 · 11 comments
Labels
type: enhancement New enhancement request

Comments

@durack1
Copy link
Collaborator

durack1 commented Mar 18, 2022

Is your feature request related to a problem?

This somewhat expands on #199 (which could be closed as a dupe - feel free to do this). Accessing information from loaded variables appears to be a little inconsistent, so for e.g.

>>> from xcdat import dataset
>>> f = "/p/css03/esgf_publish/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r1i1p1f1/Amon/ta/gn/v20191115/ta_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_185001-194912.nc"
>>> ta = dataset.open_dataset(f)
>>> ta._coord_names
{'lon', 'time', 'plev', 'lat'}
>>> ta
<xarray.Dataset>
Dimensions:    (time: 1200, bnds: 2, plev: 19, lat: 144, lon: 192)
Coordinates:
  * time       (time) datetime64[ns] 1850-01-16T12:00:00 ... 1949-12-16T12:00:00
  * plev       (plev) float64 1e+05 9.25e+04 8.5e+04 7e+04 ... 1e+03 500.0 100.0
  * lat        (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38
  * lon        (lon) float64 0.9375 2.812 4.688 6.562 ... 355.3 357.2 359.1
...

So to get the names of the axes I need to call _coord_names, which returns the axes in an order that doesn't reflect the dimension ordering which is [time, plev, lat, lon] (above) rather than {'lon', 'time', 'plev', 'lat'} (returned by _coord_names). If we had a ,getTime, .getLevel,.getLatitude, and .getLongitude function (or alternatively another function that could return all dimensions, their names and their order).

Such functionality doesn't appear to exist inxarray at least from what I have found, the closest is https://docs.xarray.dev/en/latest/generated/xarray.DataArray.get_axis_num.html

Describe the solution you'd like

Something like:

>>> from xcdat import dataset
>>> f = "/p/css03/esgf_publish/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r1i1p1f1/Amon/ta/gn/v20191115/ta_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_185001-194912.nc"
>>> ta = dataset.open_dataset(f)
>>> ta.getDimensions
{'time': 1200, 'plev': 19, 'lat': 144, 'lon': 192}
>>> ta.getLevel  # would return any vertical coordinate, so depth, height, lev, level, plev, ... (we could pull all instances of level coord names from the CMIP6/5/3 archive)
<xarray.DataArray 'plev' (plev: 19)>
array([100000.,  92500.,  85000.,  70000.,  60000.,  50000.,  40000.,  30000.,
        25000.,  20000.,  15000.,  10000.,   7000.,   5000.,   3000.,   2000.,
         1000.,    500.,    100.])
Coordinates:
  * plev     (plev) float64 1e+05 9.25e+04 8.5e+04 7e+04 ... 1e+03 500.0 100.0
Attributes:
    bounds:         plev_bnds
    units:          Pa
    axis:           Z
    positive:       down
    long_name:      pressure
    standard_name:  air_pressure

And we could implement a similar example for ta.getTime, ta.getLatitude, ta.getLongitude

Describe alternatives you've considered

NA

Additional context

NA

@durack1
Copy link
Collaborator Author

durack1 commented Mar 18, 2022

It appears that somewhat of the functionality described above exists, but not as well expanded (and useful) as the example above. ta.coords returns the coordinate names, but not the arrays, so is a partial step toward implementing ta.getLevel as described above

>>> ta.coords
Coordinates:
  * time     (time) datetime64[ns] 1850-01-16T12:00:00 ... 1949-12-16T12:00:00
  * plev     (plev) float64 1e+05 9.25e+04 8.5e+04 7e+04 ... 1e+03 500.0 100.0
  * lat      (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38
  * lon      (lon) float64 0.9375 2.812 4.688 6.562 ... 353.4 355.3 357.2 359.1

And this is also available

>>> ta.dims
Frozen({'time': 1200, 'bnds': 2, 'plev': 19, 'lat': 144, 'lon': 192})

@tomvothecoder
Copy link
Collaborator

Hi @durack1, to access the coordinates (stored as xarray DataArrays) of ta (also an DataArray):

ta.time   # or ta["time"]

If you want to limit your search to just the coordinates:

ta.coords.time   # or ta.coords["time"]

If you want to access the underlying numpy array of the coordinates, rather than the entire xarray DataArray:

ta.time.values

@durack1
Copy link
Collaborator Author

durack1 commented Mar 21, 2022

@tomvothecoder thanks, my motivation was more to have a standard function that dealt with the various names that certain coordinates may have. time is a fairly standard one, however lat vs latitude vs Y etc was the motivation. This is also true for the level coordinate which has various names - in order to access the plev coordinate info, you need to know that it's called plev and not lev, level, height, depth, alt16, alt40, alternate_hybrid_sigma, etc, etc

@pochedls
Copy link
Collaborator

pochedls commented Mar 21, 2022

I just wanted to note that xcdat uses cf_xarray. I didn't see a getLevel function (though there may be something like this), but I did see that you could do something like this:

ta[ta.cf.axes['Z']]

<xarray.DataArray 'plev' (plev: 19)>
array([100000., 92500., 85000., 70000., 60000., 50000., 40000., 30000.,
25000., 20000., 15000., 10000., 7000., 5000., 3000., 2000.,
1000., 500., 100.])
Coordinates:

  • plev (plev) float64 1e+05 9.25e+04 8.5e+04 7e+04 ... 1e+03 500.0 100.0
    Attributes:
    bounds: plev_bnds
    units: Pa
    axis: Z
    positive: down
    long_name: pressure
    standard_name: air_pressure

@durack1
Copy link
Collaborator Author

durack1 commented Mar 21, 2022

@pochedls this looks like it might get me half way or most of the way there, I presume it works for T, Z, Y, X? Yep, looks like it https://cf-xarray.readthedocs.io/en/latest/coord_axes.html

@tomvothecoder
Copy link
Collaborator

Thanks for chiming in @pochedls. I was also going to mention cf_xarray, but you beat me to it.

@durack1
Copy link
Collaborator Author

durack1 commented Mar 21, 2022

@tomvothecoder @pochedls out of curiosity have you been running timing benchmarks, or is it too early days? In my ~week long dabble in xcdat it seems far slower with IO than what I was getting with cdms2 - should I open another issue, and close this (presuming my *.cf.axes investigations yield the functionality that this feature request addresses)

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Mar 22, 2022

@tomvothecoder @pochedls out of curiosity have you been running timing benchmarks, or is it too early days? In my ~week long dabble in xcdat it seems far slower with IO than what I was getting with cdms2

We haven't benchmarked the xcdat.open_dataset()/xcdat.open_mfdataset() APIs against the cdms2.open()/cdms2.open_dataset() APIs yet.

Note, xcdat.open_dataset()/xcdat.open_mfdataset() performs additional operations on top of xarray.open_dataset()/xarray.open_mfdataset() based on the values of the function arguments.

From the docstrings (https://xcdat.readthedocs.io/en/latest/generated/xcdat.dataset.open_dataset.html):

def open_dataset(
    path: str,
    data_var: Optional[str] = None,
    decode_times: bool = True,
    lon_orient: Optional[Tuple[float, float]] = None,
    **kwargs: Dict[str, Any],
) -> xr.Dataset:
    """Wrapper for ``xarray.open_dataset()`` that applies common operations.

    Operations include:

    - Optional decoding of time coordinates with CF or non-CF compliant units if
      the Dataset has a time dimension
    - Add missing bounds for supported axis
    - Option to limit the Dataset to a single regular (non-bounds) data
      variable, while retaining any bounds data variables
    - Option to swap the longitude axis orientation and sort in ascending order
      if the axis exists in the Dataset.

I would expect a performance hit when opening a dataset with non-CF compliant time coordinates (e.g., months/years) and decode_times=True (default behavior). This case will result in a call to xcdat.dataset.decode_non_cf_time() to decode non-CF time coordinates as datetime objects.
Xarray only supports decoding CF-compliant time coordinates (also decode_times=True by default), and cdms2 doesn't decode time coordinates at all.

The xcdat APIs also automatically generates bounds for supported axes (lat, lon, time) if they don't exist, since most xcdat features require bounds. This is another operation that xarray and cdms2 do not support.


If we perform performance benchmarks, we should either:

  1. Disable the additional xcdat operations by passing decode_times=False and not automatically generating bounds, then compare to cdms2
    • Disabling these operations in the xcdat APIs would essentially mimic xarray's API (point below)
  2. Compare xarray's APIs against cdms2 directly
    • Make sure to pass decode_times=False here as well

Afterwards, we can independently benchmark each xcdat helper function such as xcdat.dataset.decode_non_cf_time() to see if we can optimize performance there.


should I open another issue, and close this (presuming my *.cf.axes investigations yield the functionality that this feature request addresses)

I'll open up an issue for performance benchmarking with the aforementioned details and you can close this issue based on your investigation.

@durack1
Copy link
Collaborator Author

durack1 commented Mar 23, 2022

@tomvothecoder thanks for the #200 (comment). I note that cdms2 doesn't obviously have an equivalent to decode_times=False (this has been an issue before regarding IO slow timings when you only want to use indexed, not time etc coordinate data), so in theory xcdat/xarray should be able to systematically beat cdms2

@durack1
Copy link
Collaborator Author

durack1 commented Mar 24, 2022

I note that the file open error in #183 and the correct accessing of axes is related, so worth noting these open issues are related.

I also note that I have been hitting many issues in both cdms2 and xcdat/xarray.open_dataset when reading out native CMIP6 holdings, so this is an option for testing out any prototype solutions against a very large (10x millions) of netcdf files that are all meant to be CF- and CMIP6 format compliant

@tomvothecoder
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New enhancement request
Projects
None yet
Development

No branches or pull requests

3 participants