Skip to content

Iteration of sea-ice observational time series appears incorrect #69

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
milenaveneziani opened this issue Jan 6, 2017 · 43 comments
Closed
Assignees

Comments

@milenaveneziani
Copy link
Collaborator

milenaveneziani commented Jan 6, 2017

When plotting sea-ice aggregate area and volume time series, we also plot the climatological cycle of observational values. This cycle is repeated to cover the full extent of the model time series.
Something must be slightly off when setting the repeated cycle for the observational data, as the model and obs time series appear shifted by quite a bit after many cycles. For example, here is a plot of SH sea-ice area after about 200 years of one of the ACME beta0 simulations:

iceareacellnh_20161117 beta0 a_wcycl1850s ne30_oec_icg edison

@xylar
Copy link
Collaborator

xylar commented Jan 6, 2017

@milenaveneziani, this bug was probably introduced by fe893b6. I'll put it on my list to look into in a couple of weeks when I have time. If it's urgent, take a look at what that commit did and see if you can figure out how I might have got the length of the repeated cycle wrong.

@milenaveneziani
Copy link
Collaborator Author

@xylar: as an FYI, I am working on trying to understand this issue. I would like to include it into a more general enhancement of sea_ice/timeseries.py, where I want to split long time series in two separate panels.
What do you think? (pinging @pwolfram as well)

@pwolfram
Copy link
Contributor

@milenaveneziani, I would prefer we have a bug-fix PR and then an enhancement PR. This helps keep things clean and easier to review. Is there an advantage to having a single PR that I'm missing?

@milenaveneziani
Copy link
Collaborator Author

no, no worries, I am fine with 2 PR's. I was just being lazy :)

@xylar
Copy link
Collaborator

xylar commented Jan 20, 2017

Sorry, this has been on my list for the last week but I haven't got to it yet. I can try to fix it tonight.

I suspect the problem might be that shiftT designed to actually be slightly longer than the period of the cycle:
fe893b6#diff-8b7c98f5c257dcda60ae69424ef73bf2L308
This may have been right for the old way of shifting the cycle (though I don't see why off the top of my head) but it's probably wrong for the new way. Anyway, the first thing I would try is removing the:

(dsshift.Time.isel(Time=1) - dsshift.Time.isel(Time=0))

@milenaveneziani
Copy link
Collaborator Author

ok, I will try that.

One other solution that I was thinking to try is to not shift the dsshift Time at all, but simply equal it to ds.Time. I tried it and the two Times are obviously the same, but I am getting a different error at plotting time, which I was about to investigate.

@xylar
Copy link
Collaborator

xylar commented Jan 20, 2017

If you're working on it, I'll leave it to you. If you don't come up with a fix, let me know and I'll give it another go.

@milenaveneziani
Copy link
Collaborator Author

OK, I have not been very successful, but at least now I see where the problem arises.
We define startIndex from this quantity: ds.Time.min()-ds_toreplicate.Time.min(). When printing the two terms I get:
ds.Time.min= 2050-01-16T13:19:41.133264000-0800
ds_toreplicate.Time.min= 1851-01-15T00:00:00.000000000Z
From this, I would expect that their difference is 199 years and 1 day, but when you print the python result you actually get 6280060781133264000 nanoseconds, which converted to years/days is 199 years and ~51 days. Why do you think there is this discrepancy?
51 days is similar to the shift I see in the plots, so I expect that if we remove this shift we should be able to solve the problem.

@milenaveneziani
Copy link
Collaborator Author

Also, the branch I am using to work on this is fixReplicateCycle, in case it helps.

@xylar
Copy link
Collaborator

xylar commented Jan 21, 2017

@milenaveneziani, I'm using a beta0 run (20161117.beta0.A_WCYCL1850.ne30_oEC.edison) with:

yr_offset = 1849
# start and end years for timeseries analysis
timeseries_yr1 = 41
timeseries_yr2 = 50

and I'm not seeing this problem. When i print the dates, I see:

ds.Time.min = 1890-01-14T04:58:13.242529000Z
ds_toreplicate.Time.min = 1850-01-15T00:00:00.000000000Z
ds.Time.min - ds_toreplicate.Time.min = 40.0252249252 # years

First, it's weird that I see ds_toreplicte starting a year earlier than you do, but maybe your yr_offset is 1850?
Second, I'm seeing times with a common formatting (ending in Z) whereas you see two different time formats, where your data in ds.Time appears to have a time zone of -0800 (Pacific time?). I'm not sure how this is happening but I suspect it is somehow related to the incorrect time difference.

To debug further, we would need to track through mpas_xarray to see what time format it thinks it's getting and why it's doing the time conversion incorrectly.

Could you send me a gist (or whatever link is convenient) containing your config file? That would let me debug with exactly the same run as you.

Example images from my run:
iceareacellnh_20161117 beta0 a_wcycl1850 ne30_oec edison_b1850c5_ne30_v0 4
icevolumecellnh_20161117 beta0 a_wcycl1850 ne30_oec edison_b1850c5_ne30_v0 4

@milenaveneziani
Copy link
Collaborator Author

@xylar, first of all, here is my config file:
https://gist.github.com/milenaveneziani/7504699434c352a99b63bfacd4d1b3cb
As you can see, it now uses year_offset=1849, but indeed I was using 1850 before, so that is explained.
I have also tried the same years as yours (41 and 50) and I too get both times ending in Z:
ds.Time.min = 1890-01-14T04:58:13.242529000Z
ds_toreplicate.Time.min = 1850-01-15T00:00:00.000000000Z
ds.Time.min - ds_toreplicate.Time.min = 1262235493242529000 nanoseconds
So it seems that we get different formats depending on what year we use, which is really strange.. (before I was using 200 and 203).
One final comment is that, even though the formats are the same for your choice of years, the difference is still 40 years and 9 days, which is puzzling.
Is there something strange happening with the xarray dates?

and PS: how do you tell the script to print times in days rather than nanoseconds?
thanks.

@xylar
Copy link
Collaborator

xylar commented Jan 21, 2017 via email

@milenaveneziani
Copy link
Collaborator Author

ah, that's a good point about leap years! Model years do not have leap years. Maybe python/xarray default calendar is gregorian? There should be a way to define the calendar, right?

and thanks about the tip to print days (I actually meant 'years'): I was missing the part to convert to float so I was just getting the integer year value.

@xylar
Copy link
Collaborator

xylar commented Jan 22, 2017

@milenaveneziani, looking into this further, it does appear that xarray currently only supports the standard Gregorian calendar (with leap years):
pydata/xarray#1084
I didn't read this issue in detail yet, but it may be possible (and even desirable) to force xarray to use netcdftime objects rather than numpy.datetime64. This might have some unintended consequences, though.

An alternative would be to convert to netcdftime or a standard python datetime on the 365-day calendar only in replicate_cycle, since that seems to be the only place we're running into this problem so far.

Anyway, this doesn't seem like it'll be a super simple fix...

@xylar
Copy link
Collaborator

xylar commented Jan 22, 2017

Reading a bit more, I don't think forcing xarray to read times as netcdftime objects will fix our problems. Instead, I think the solution here is likely to be to convert locally in replicate_cycle using netCDF4.date2num (http://unidata.github.io/netcdf4-python/#netCDF4.date2num) and then to use those numbers to do our math.

I'll give this a try tomorrow.

@pwolfram
Copy link
Contributor

@xylar, can you please clarify your understanding that xarray only supports Gregorian (leap) calendars? I took a quick look at pydata/xarray#1084, but based on my quick read of http://xarray.pydata.org/en/stable/io.html?highlight=gregorian#time-units it appears that other calendars should be supported.

@shoyer, is it true xarray only supports Gregorian (leap) calendars currently? If this is the case we need to edit the docs for xarray so that this is clearer. It appears this is the case because we are currently locked into datetime64[ns] (pydata/xarray#789) but I just wanted to double-check with you both. I'm hoping there is an optional flag we have missed or a lack of proper time conversion in our preprocessing step but am suspicious this is pointing again to the structural datetime issue we are working to surmount in xarray, e.g., in pydata/xarray#1084.

Any recommendation or insights you can provide @shoyer on non-Gregorian calendar options in xarray would be really helpful.

@shoyer
Copy link

shoyer commented Jan 23, 2017

I have not used non-Gregorian calendars personally, but I'm pretty sure they should work at least in a minimal fashion (e.g., for loading data, but not for dedicated datetime routines). Certainly @jhamman put in a lot of work to ensure that we fall back to loading dates at netcdftime for non-standard calendars. The result should be an array with dtype=object filled with netcdftime.datetime instances, rather than dtype=datetime64[ns].

@xylar
Copy link
Collaborator

xylar commented Jan 23, 2017

@pwolfram, I could be wrong but I think a lot of things we do with xarray won't work correctly with netcdftime.datetime times instead of numpy.datetime64 times. We do a lot of math in various places with times.

In any case, our times are being read presently as if they are Gregorian even though they are, in fact, on a 365-day calendar, presumably because MPAS output is not CF compliant and no calendar is specified.

@shoyer
Copy link

shoyer commented Jan 23, 2017

Someone recently added arithmetic support to netcdftime.datetime, so that may work better now.

@xylar
Copy link
Collaborator

xylar commented Jan 23, 2017

@shoyer, great to know. I just started looking into netcdftime.datetime over the weekend so I'm not fully up to date on what it can do.

@milenaveneziani and @pwolfram, it seems like we should make a new issue for looking into switching to netcdftime.datetime instead of numpy.datetime64. Presumably this will involve some changes in mpas_xarray and perhaps elsewhere. This issue would then be a subset of that larger issue.

@milenaveneziani
Copy link
Collaborator Author

@xylar, @pwolfram: I am looking through all of our routines where we use/process times, and I don't see a place when we actually make use of numpy.datetime. We use datetime on its own a lot, and panda.to_timedelta, and I think that's it.
Now, having said that, if I look through the datetime doc pages (https://docs.python.org/2/library/datetime.html) I do seem to understand that the only supported calendar is gregorian. So, is this the problem we are having, that 'datetime' does not support anything other than gregorian?
Not that it changes anything, since I don't see a choose-calendar option in numpy.datetime64 either, but just to understand..

@xylar
Copy link
Collaborator

xylar commented Jan 24, 2017

@milenaveneziani, here is one place we handle Times from xarray datasets assuming they might be numpy.datetime64 objects:
https://github.com/MPAS-Dev/MPAS-Analysis/blob/develop/mpas_analysis/shared/mpas_xarray/mpas_xarray.py#L183

You are right that we often convert to datetime to do our math, and that is almost certainly not what we want to be doing.

In the present case (replicate_cycle) we are doing math directly on Time from an xarray dataset:
https://github.com/MPAS-Dev/MPAS-Analysis/blob/develop/mpas_analysis/sea_ice/timeseries.py#L328
which is going to be of type numpy.datetime64 (since this is what pandas builds on top of and what xarray has primarily supported up until recently).

@pwolfram
Copy link
Contributor

@xylar, since we are basically just extending a yearly climatology could we try a different approach for replicate_cycle that would "inject" monthly results for the climatology into new years? I think this would help avoid the issue with leap-years and datetime library choices and it would be conceptually cleaner anyway because the monthly climatologies should be more or less insensitive to the leap-year issue. I'm happy to help on this if you think this is a reasonable idea.

@milenaveneziani
Copy link
Collaborator Author

@pwolfram: I worry that, if we do not clearly specify the calendar we use in the model, we may run into other similar issues in the future.
Still, it is not clear to me how much of a change we'd have to introduce if we went with netcdf.datetime. I could give it a try..

@xylar
Copy link
Collaborator

xylar commented Jan 25, 2017

@pwolfram, I agree that we could probably come up with an alternative way of handling replicate_cycle, though it would not be as general. That being said, I agree with @milenaveneziani that this would be a workaround to fix only this problem and we would do better to address the overarching problem of having the wrong calendar.

Part of the reason for the problem is that xarray defaults to the Gregorian calendar if no calendar is specified in the NetCDF files (in a CF-compliant way). Since MPAS files are not CF-compliant, at least not in this way, it has no way of knowing that we want a 365-day calendar. I haven't (so far) found a way of explicitly overriding the calendar either, which might leave us with the option of handling this ourselves in mpas_xarray.

I think maybe this could be handled here:
https://github.com/MPAS-Dev/MPAS-Analysis/blob/develop/mpas_analysis/shared/mpas_xarray/mpas_xarray.py#L144

I'll try this out on a branch and let you know how it goes.

@xylar
Copy link
Collaborator

xylar commented Jan 25, 2017

My latest findings from this evening's research:

  • netcdftime currently has no documentation I could find, so I'm left reading the docstrings directly from the file.
  • It seems like the version netcdftime in newer versions of netcdf4 or in the standalone repository does, indeed, support various math operations and might be what we need.
  • These versions are not yet supported by conda (or have dependency conflicts with other libraries such as cartopy).

So I will keep working toward a test branch to see if this will be a workable solution for us. It seems very promising. But I think, at least for a little while, it may be a big pain because we won't be able to get the library versions we need with conda.

@milenaveneziani
Copy link
Collaborator Author

@xylar: I guess I was looking at something different, ie the num2date and date2num functions within netCDF4 (http://unidata.github.io/netcdf4-python/#section7).
I wondered whether we could use datetime to get the correct format for date2num, and then date2num with the correct calendar to get the correct time in days.
Is this too simplistic?

@xylar
Copy link
Collaborator

xylar commented Jan 25, 2017

@milenaveneziani, for purposes of solving this specific issue, I agree that netcdf4.date2num could be a reasonable solution. But it's a really cumbersome thing to have to do in every case where we want to work with dates. What is more, things like time averages within xarray may not act as desired because of leap years. I'm somewhat more worried about the problems we don't yet realize we have.

So I think switching over to netcdftime.datetime would be preferable as an overall solution to our calendar and year restrictions.

@pwolfram
Copy link
Contributor

Based on taking another look at this thread @milenaveneziani and @xylar, I think the best solution is to use netcdftime.datetime and adapt xarray if needed, potentially working with @jhamman if goals are coincident. I've monkey-patched the date issue in the past and agree this is not a long term solution. However, we need to be absolutely sure that this change doesn't break existing or future xarray functionality that we will need. In the worst case, we make the change, everything works, and something subtle changes in xarray that breaks our code unexpectedly following an xarray version change. This would be really bad and implies that we should integrate a dataset with alternative calendar choices within xarray's testing framework to make sure that this change doesn't artificially break key xarray functionality. If it does, I would argue that this would need fixed prior to widespread use of this method to keep our code robust for the present and future.

cc @shoyer for additional comment if needed.

@shoyer
Copy link

shoyer commented Jan 26, 2017

I don't know if this helps much, but it's pretty straightforward to inject attributes (like calendar) into xarray variables between decoding into datetime objects. This could ensure you end up with netcdftime objects, e.g.,

ds = xr.open_dataset(path, decode_times=False)
ds['time'].attrs['calendar'] = '365_day'
ds2 = xr.decode_cf(ds)
# ds2['time'] is guaranteed to consistent of netcdftime objects

@xylar
Copy link
Collaborator

xylar commented Jan 26, 2017

@shoyer, that's very helpful, indeed!

@pwolfram, I think this means that all our existing calls to xr.open_mfdataset would need to pass on decode_times=False. Then, we would construct a Time array similarly to how we do it now in mpas_xarray. But where we currently make datetime.datetime objects, we would presumably insead be making date/time strings that xarray can decode for us. We would also add the calendar attribute (365_day) to the Time array before make the call to xr.decode_cf.

@pwolfram
Copy link
Contributor

Agreed @xylar, we would just need to set this up for xarray to decode. I still think it may be useful to add some testing code into xarray to make sure that the functionality we are leveraging remains unbroken for future versions.

@xylar and @milenaveneziani, if you can point me to an example I can use to reproduce the error in this issue I can probably make the needed changes for the fix.

@milenaveneziani
Copy link
Collaborator Author

@pwolfram, you can use any run that is long enough to show the shift clearly. The beta0 simulation is more than 250 years long, so you could choose that and timeseries_y1=200, timeseries_yr2=205.

@xylar
Copy link
Collaborator

xylar commented Jan 26, 2017

@pwolfram, I'm working on this in a branch, so far without luck. Here are my findings today:

  • I checked out the current master branches for netcdftime and xarray to make sure all functionality is as cutting edge as possible.
  • Even with these versions, I couldn't get the Time array to behave as I wanted.
  • I was able to construct an array of cf-compliant Time values with units attribute of either days since 1850-01-01 or seconds since 1850-01-01 and with calendar attribute 365_day. When I run xr.decode_cf, these are converted to numpy.datetime64, not netcdftime.datetime as expected.
  • I was also able to construct an array of cf-compliant Time values with units attribute of either days since 0001-01-01 or seconds since 0001-01-01 and with calendar attribute 365_day. When I run xr.decode_cf, these are converted to netcdftime.datetime as expected. But netcdftime.datetime does not support addition and subtraction, so a new method will be needed to find the midpoint between xtime_start and xtime_end.

I'll keep working on this, but it's not looking terribly easy so far.

@xylar
Copy link
Collaborator

xylar commented Jan 26, 2017

@pwolfram, are you interested in fixing replicate_cycle or the broader problem? If just replicate_cycle, go for it. If the broader problem, maybe let me tinker a bit so we don't replicate efforts.

@xylar
Copy link
Collaborator

xylar commented Jan 26, 2017

Looking at the netcdftime code, it seems like math operations should work as we want them to, at least with the latest version. I'll see if I can get them to work as I expect outside of xarray. Then, maybe we can figure out whether they will work for us in xarray.

@pwolfram
Copy link
Contributor

@xylar, I'm assuming were you able to get netcdftime math operations to work as expected, correct?

@pwolfram
Copy link
Contributor

pwolfram commented Jan 27, 2017

Note, this obviously implies we should be testing some dataset math following our custom mpas_xarray preprocessor in the testing framework so that as we make changes in the future simple arithmetic operations are as we would expect.

@xylar
Copy link
Collaborator

xylar commented Jan 28, 2017

@pwolfram asked:

I'm assuming were you able to get netcdftime math operations to work as expected, correct?

I will comment more in #82. The basic answer is that I was not able to get the expected types from xarray. When I supply a calendar manually as an attribute, I still get either a numpy.datetime64 or a netceftime.datetime object, neither of which correspond with the 365_day calendar I'm asking for. So it seems like there's a problem with how xarray.decode_cf is handling the calendar attribute (or I made a mistake). I would welcome having you set up a simple example and seeing if you have better luck.

@xylar
Copy link
Collaborator

xylar commented Jan 28, 2017

Note, this obviously implies we should be testing some dataset math following our custom mpas_xarray preprocessor in the testing framework so that as we make changes in the future simple arithmetic operations are as we would expect.

Yes, I agree. I will make this a priority for inclusion in #82.

@jhamman
Copy link

jhamman commented Jan 28, 2017

Let me jump in here and point out a few things...

  1. Back in 2014, we made a decision to return numpy datetime objects for the noleap calendar so that we could get the groupby/resample functionality to work on the time indexes. Basically, every date in the noleap calendar is valid in the standard calendar so we can safely cast netcdftime.datetime objects to datetime64 objects. The math on these indexes that you are trying to perform will certainly not work since pandas/numpy/xarray basically ignores the fact that these timestamps are on a different calendar.
  2. We're working to correct this in both xarray and netcdftime. @spencerkclark is working on NetCDFTimeIndex that will start to allow for some of these sort of operations. More importantly, it seems this is the only short-medium term solution to properly handle dates for non-standard calendars in xarray/pandas. Numpy is also (maybe) moving towards a broader solution/refactor of the datetime64 object but we're not really holding our breath waiting for that to be implemented.
  3. Assuming you probably want a shorter term work-around for the problem you've run into, I would think it makes sense to force xarray to let you stick with netcdftime. One way to do this in the very short term is:
In [28]: import xarray as xr

In [29]: import netCDF4 as nc4

In [30]: ds = xr.tutorial.load_dataset('rasm')  # this dataset uses a noleap calenar

In [31]: ds.time.dtype
Out[31]: dtype('<M8[ns]')

In [32]: ds = xr.tutorial.load_dataset('rasm', decode_cf=False)

In [33]: ds.time.dtype
Out[33]: dtype('float64')

In [34]: ds['time'] = nc4.num2date(ds.time, ds.time.attrs['units'], ds.time.attrs['calendar'])

In [35]: ds.time.dtype
Out[35]: dtype('O')

In [36]: ds.time
Out[36]: 
<xarray.DataArray 'time' (time: 36)>
array([netcdftime._netcdftime.DatetimeNoLeap(1980, 9, 16, 12, 0, 0, 0, 0, 259),
       netcdftime._netcdftime.DatetimeNoLeap(1980, 10, 17, 0, 0, 0, 0, 3, 290),
...

By doing this, you'll loose some of the nice parts of using the numpy datetimes but your math will work.

Hopefully, some of this is useful to you all. We keep seeing issues related to calendars pop up and we're working to fix those. Of course, if anyone is keen to help work on the larger problems, there's plenty to do!

@xylar
Copy link
Collaborator

xylar commented Jan 28, 2017

Thanks @jhamman! This is very helpful. I now understand why what I tried didn't work as I expected.

Your suggested solution is essentially what I have implemented. We have the additional complication that we sometimes (typically, actually) have a date as a string, not as a float64. It is easy for us to parse out years, months, days, hours, minutes, seconds. But it doesn't seem like netcdftime has an easy way to construct the correct netcdftime.datetime subclass for a given calendar from this information. I had manually created a dictionary to map from calendar name to the appropriate class and then to construct the corresponding object:

    NetCDFDatetimes = {
        'standard': datetime.datetime,
        'gregorian': datetime.datetime,
        'proleptic_gregorian': netcdftime.DatetimeProlepticGregorian,
        'noleap': netcdftime.DatetimeNoLeap,
        'julian': netcdftime.DatetimeJulian,
        'all_leap': netcdftime.DatetimeAllLeap,
        '365_day': netcdftime.DatetimeNoLeap,
        '366_day': netcdftime.DatetimeAllLeap,
        '360_day': netcdftime.Datetime360Day}

    NetCDFDatetime = NetCDFDatetimes[calendar]
   ...
   date = NetCDFDatetime(year=year,  month=month, day=day, hour=hour, minute=minute, second=second)

Is there a cleverer way to do this?

The only option I've come up with is making a datetime.datetime and then doing:

date = datetime.datetime(year=year,  month=month, day=day, hour=hour, minute=minute, second=second)
nc4.num2date(nc4.date2num(date, units, calendar))

My main beef with that approach is the unnecessary rounding and extra math that has to happen, and also that it involves datetime.datetime in a somewhat confusing way.

cc @pwolfram

@xylar xylar self-assigned this Feb 10, 2017
@milenaveneziani
Copy link
Collaborator Author

Issue addressed by #111. Problem will completely go away when we adopt the newer MPAS-Seaice timeSeriesStats am.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants