Skip to content

Calling Dataset.mean() drops coordinates #3510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fergu opened this issue Nov 11, 2019 · 7 comments · Fixed by #4940
Closed

Calling Dataset.mean() drops coordinates #3510

fergu opened this issue Nov 11, 2019 · 7 comments · Fixed by #4940

Comments

@fergu
Copy link

fergu commented Nov 11, 2019

This is a similar issue to bug #1470.

MCVE Code Sample

import xarray as xr
import numpy as np

x = np.linspace(0,1,5)
y = np.linspace(-1,0,5)
t = np.linspace(0,10,10)

dataArray1 = xr.DataArray(np.random.random((5,5,10)),
                            dims=('x','y','t'),
                            coords={'x':x,'y':y,'t':t})

dataArray2 = xr.DataArray(np.random.random((5,5,10)),
                            dims=('x','y','t'),
                            coords={'x':x,'y':y,'t':t})

dataset = xr.Dataset({'a':dataArray1,'b':dataArray2})

datasetWithCoords = xr.Dataset({'a':dataArray1,'b':dataArray2},coords={'x':x,'y':y,'t':t})

print("datarray1:")
print(dataArray1)

print("dataArray1 after mean")
print(dataArray1.mean(axis=0))

print("dataset:")
print(dataset)

print("dataset after mean")
print(dataset.mean(axis=0))

print("dataset with coords:")
print(datasetWithCoords)

print("dataset with coords after mean")
print(datasetWithCoords.mean(axis=0))

Output (with extra stuff snipped for brevity):

datarray1:
<xarray.DataArray (x: 5, y: 5, t: 10)>
<array printout>
Coordinates:
  * x        (x) float64 0.0 0.25 0.5 0.75 1.0
  * y        (y) float64 -1.0 -0.75 -0.5 -0.25 0.0
  * t        (t) float64 0.0 1.111 2.222 3.333 4.444 ... 6.667 7.778 8.889 10.0
dataArray1 after mean
<xarray.DataArray (y: 5, t: 10)>
<array printout>
Coordinates:
  * y        (y) float64 -1.0 -0.75 -0.5 -0.25 0.0
  * t        (t) float64 0.0 1.111 2.222 3.333 4.444 ... 6.667 7.778 8.889 10.0
### Note that coordinates are kept after the mean operation when performed just on an array

dataset:
<xarray.Dataset>
Dimensions:  (t: 10, x: 5, y: 5)
Coordinates:
  * x        (x) float64 0.0 0.25 0.5 0.75 1.0
  * y        (y) float64 -1.0 -0.75 -0.5 -0.25 0.0
  * t        (t) float64 0.0 1.111 2.222 3.333 4.444 ... 6.667 7.778 8.889 10.0
Data variables:
    a        (x, y, t) float64 0.1664 0.8147 0.5346 ... 0.2241 0.9872 0.9351
    b        (x, y, t) float64 0.6135 0.2305 0.8146 ... 0.6323 0.5638 0.9762
dataset after mean
<xarray.Dataset>
Dimensions:  (t: 10, y: 5)
Dimensions without coordinates: t, y
Data variables:
    a        (y, t) float64 0.2006 0.6135 0.6345 0.2415 ... 0.3047 0.4983 0.4734
    b        (y, t) float64 0.3459 0.4361 0.7502 0.508 ... 0.6943 0.4702 0.4284
dataset with coords:
<xarray.Dataset>
Dimensions:  (t: 10, x: 5, y: 5)
Coordinates:
  * x        (x) float64 0.0 0.25 0.5 0.75 1.0
  * y        (y) float64 -1.0 -0.75 -0.5 -0.25 0.0
  * t        (t) float64 0.0 1.111 2.222 3.333 4.444 ... 6.667 7.778 8.889 10.0
Data variables:
    a        (x, y, t) float64 0.1664 0.8147 0.5346 ... 0.2241 0.9872 0.9351
    b        (x, y, t) float64 0.6135 0.2305 0.8146 ... 0.6323 0.5638 0.9762
dataset with coords after mean
<xarray.Dataset>
Dimensions:  (t: 10, y: 5)
Dimensions without coordinates: t, y
Data variables:
    a        (y, t) float64 0.2006 0.6135 0.6345 0.2415 ... 0.3047 0.4983 0.4734
    b        (y, t) float64 0.3459 0.4361 0.7502 0.508 ... 0.6943 0.4702 0.4284

It's also worth mentioning that the data arrays contained in the dataset also loose their coordinates during this operation. I.E:

>>> print(dataset.mean(axis=0)['a'])
<xarray.DataArray 'a' (y: 5, t: 10)>
array([[0.4974686 , 0.44360968, 0.62252578, 0.56453058, 0.45996295,
        0.51323367, 0.54304355, 0.64448021, 0.50438884, 0.37762424],
       [0.43043363, 0.47008095, 0.23738985, 0.58194424, 0.50207939,
        0.45236528, 0.45457519, 0.67353014, 0.54388373, 0.52579016],
       [0.42944067, 0.51871646, 0.28812999, 0.53518657, 0.57115733,
        0.62391936, 0.40276949, 0.2385865 , 0.6050159 , 0.56724394],
       [0.43676851, 0.43539912, 0.30910443, 0.45708179, 0.44772562,
        0.58081722, 0.3608285 , 0.69107338, 0.37702932, 0.34231931],
       [0.56137156, 0.62710607, 0.77171961, 0.58043904, 0.80014925,
        0.67720374, 0.73277691, 0.85934107, 0.53542093, 0.3573311 ]])
Dimensions without coordinates: y, t

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 158 Stepping 11, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None libhdf5: 1.10.5 libnetcdf: 4.7.2

xarray: 0.14.0
pandas: 0.25.2
numpy: 1.17.3
scipy: 1.3.1
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.0.4.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.7.0
distributed: None
matplotlib: 3.1.1
cartopy: None
seaborn: None
numbagg: None
setuptools: 40.8.0
pip: 19.3
conda: None
pytest: None
IPython: 7.8.0
sphinx: None
None

@max-sixty
Copy link
Collaborator

max-sixty commented Nov 11, 2019

Thanks for the issue and clear example @kjfergu . Agree that's not robust.

Is this specific case fixed by passing dim= rather than axis=?

dim gives you all the advantages of xarray. Is there a reason the example passes axis?

That said, we should be consistent, or raise on passing axis...

@fergu
Copy link
Author

fergu commented Nov 11, 2019

@max-sixty I can confirm that using dim='x' (for example) does work as expected.

Honestly, I just used axis because that's the form I'm used to passing when using various other operations that act on an array with numpy - I admittedly did not make a point to look closely at the docs for that particular case and just assumed because axis worked that I was doing it correctly.

Interestingly, as a side note, I did initially try axis='x' (which didn't work) before submitting this bug. So - maybe it was an education thing on my part. The fact that dim was a possible argument didn't occur to me, but that I could maybe use labeled coordinates did.

@max-sixty
Copy link
Collaborator

Thanks @kjfergu

I think that's a fairly common 'misuse'. Solving the coords problem here doesn't solve the larger problem of an easier learning curve where xarray is helpful quickly—which likely involves using dim rather than axis.

We could raise on axis but potentially that's too restrictive: any thoughts from anyone? What are good reasons to use axis?

@dcherian
Copy link
Contributor

We could raise on axis but potentially that's too restrictive: any thoughts from anyone? What are good reasons to use axis?

I don't know of any. I didn't know it was possible!

But since we don't guarantee that all DataArrays in a Dataset have the same order-of-dimensions, this is just asking for trouble.

@fergu
Copy link
Author

fergu commented Nov 12, 2019

What about raising a warning when axis is used? Something like Using the axis keyword results in lost information about coordinates, etc. Use the dim keyword to suppress this warning?

@TomNicholas
Copy link
Member

Can someone explain why we even allow axis as a valid argument? I can't think of any good reasons why we should keep it, and it's inconsistent with the API in most places I think.

@dcherian
Copy link
Contributor

dcherian commented Apr 9, 2020

I agree with removing. We could raise an error saying please use 'dim' instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants