Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update open_mfdataset() to avoid data vars dim concatenation #143

Merged
merged 6 commits into from
Nov 15, 2021

Conversation

tomvothecoder
Copy link
Collaborator

@tomvothecoder tomvothecoder commented Nov 9, 2021

Description

This PR updates xcdat.open_mfdataset() with a default option for data_vars="minimal to avoid data vars dim concatenation (which breaks a reduction in spatial averaging).

The default xr.open_mfdataset() behavior is to concat dims across data vars. Adding data_vars="minimal" stops this from happening. Notice below that the time dimension is no longer added to dims of lat_bnds and lon_bnds when passing this kwarg.

>>> dir = "/p/user_pub/PCMDIobs/PCMDIobs2/atmos/3hr/pr/TRMM-3B43v-7/gn/v20200707"
>>> file_pattern = "pr_3hr_TRMM-3B43v-7_BE_gn_v20200707_1998*.nc"
>>> files = os.path.join(dir, file_pattern)

>>> ds_xr = xr.open_mfdataset(files)
>>> ds_xr2 = xr.open_mfdataset(files, data_vars="minimal")

>>> print(ds_xr)
<xarray.Dataset>
Dimensions:    (time: 2920, bnds: 2, lat: 400, lon: 1440)
Coordinates:
  * time       (time) datetime64[ns] 1998-01-01 ... 1998-12-31T21:00:00
  * lat        (lat) float64 -49.88 -49.62 -49.38 -49.12 ... 49.38 49.62 49.88
  * lon        (lon) float64 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) datetime64[ns] dask.array<chunksize=(248, 2), meta=np.ndarray>
    lat_bnds   (time, lat, bnds) float64 dask.array<chunksize=(248, 400, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 dask.array<chunksize=(248, 1440, 2), meta=np.ndarray>
    pr         (time, lat, lon) float32 dask.array<chunksize=(248, 400, 1440), meta=np.ndarray>

>>> print(ds_xr2)
<xarray.Dataset>
Dimensions:    (time: 2920, bnds: 2, lat: 400, lon: 1440)
Coordinates:
  * time       (time) datetime64[ns] 1998-01-01 ... 1998-12-31T21:00:00
  * lat        (lat) float64 -49.88 -49.62 -49.38 -49.12 ... 49.38 49.62 49.88
  * lon        (lon) float64 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) datetime64[ns] dask.array<chunksize=(248, 2), meta=np.ndarray>
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(400, 2), meta=np.ndarray>
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(1440, 2), meta=np.ndarray>
    pr         (time, lat, lon) float32 dask.array<chunksize=(248, 400, 1440), meta=np.ndarray>

In #136, the time dimension was being added to lat_bnds and lon_bnds.
As a result, the lat and lon dims were being reduced down, leaving just the time dimension. This would produce the error KeyError: 'Check weights DataArray includes lat dimension.'
https://github.com/XCDAT/xcdat/blob/ca9dee4390cad2ad9856f9959e75981575cc69d7/xcdat/spatial_avg.py#L592-L596

More info on xarray.open_mfdataset().

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

@codecov-commenter
Copy link

codecov-commenter commented Nov 9, 2021

Codecov Report

Merging #143 (adb16c4) into main (68e8c00) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #143   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            7         7           
  Lines          345       346    +1     
=========================================
+ Hits           345       346    +1     
Impacted Files Coverage Δ
xcdat/dataset.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68e8c00...adb16c4. Read the comment docs.

@tomvothecoder
Copy link
Collaborator Author

Test script

# flake8:noqa F401

import os

import xarray as xr

import xcdat

dir = "/p/user_pub/PCMDIobs/PCMDIobs2/atmos/3hr/pr/TRMM-3B43v-7/gn/v20200707"
file_pattern = "pr_3hr_TRMM-3B43v-7_BE_gn_v20200707_1998*.nc"
files = os.path.join(dir, file_pattern)

#%%
# Test using `xr.open_mfdataset()`
ds_xr1 = xr.open_mfdataset(files)
pr_global1 = ds_xr1.xcdat.spatial_avg(
    axis=["lat", "lon"], data_var="pr"
).pr  # raises error

#%%
ds_xr2 = xr.open_mfdataset(files, data_vars="minimal")
pr_global2 = ds_xr2.xcdat.spatial_avg(axis=["lat", "lon"], data_var="pr").pr  # no error

#%%
# Test using `xcdat.open_mfdataset()` and `data_vars="minimal"`
ds_xcdat = xcdat.open_mfdataset(files)
pr_global3 = ds_xcdat.xcdat.spatial_avg(axis=["lat", "lon"]).pr  # no error

@tomvothecoder tomvothecoder force-pushed the bugfix/136-mfdataset-keyerror branch from 0706b47 to cf434ae Compare November 9, 2021 22:22
@tomvothecoder tomvothecoder added Priority: High type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. labels Nov 9, 2021
@tomvothecoder tomvothecoder self-assigned this Nov 9, 2021
@tomvothecoder
Copy link
Collaborator Author

Hey Steve, here's the related xarray issue with the data_vars="minimal" fix: pydata/xarray#438

@tomvothecoder tomvothecoder force-pushed the bugfix/136-mfdataset-keyerror branch from cf434ae to 11af299 Compare November 11, 2021 19:12
@tomvothecoder tomvothecoder force-pushed the bugfix/136-mfdataset-keyerror branch from 74b2a81 to 1d4bd98 Compare November 12, 2021 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KeyError in spatial_average when open multiple datasets
2 participants