Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Get intersection of _Periods_ of available data across all DataSources, instead of using datetimes. #223

Closed
12 tasks done
Tracked by #213
JackKelly opened this issue Oct 12, 2021 · 3 comments · Fixed by #220, #256 or #274
Closed
12 tasks done
Tracked by #213
Assignees
Labels
data New data source or feature; or modification of existing data source enhancement New feature or request

Comments

@JackKelly
Copy link
Member

JackKelly commented Oct 12, 2021

Detailed Description

At present, when using GSP-region PV data, nowcasting_dataset produces ML training examples with t0 datetimes at 0 and 30 minutes past the hour.

We should experiment with enabling nowcasting_dataset to set t0 datetimes to 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55 minutes past the hour. This would increase the number of training examples by 6x (which might be quite a big deal!) and, when we run a live service, we probably want to be able to update our PV forecasts every 5 minutes (when we get new satellite data & PVOutput.org data)

Possible Implementation

Self-attention models don't need different modalities to be perfectly aligned in space and time. So, we could produce examples where t0 could be at any 5-minute increment, and there's always, say, exactly 2 historical timesteps of GSP PV data; and 4 forecast timesteps of GSP PV data. For example, an example with t0 = 12:15, history_minutes = 60, and forecast_minutes=120 would look something like this:

  • satellite data at 5-minute steps from 11:15 to 14:15
  • GSP PV data:
    • history: 11:30, 12:00
    • forecast: 12:30, 13:00, 13:30, and 14:00

I think DataSource.get_example should work as required already, although it'll need testing.

The tricker bit is changing DataModule._get_t0_datetimes() to be independent of the sample period of the various DataSources. I think the way forwards here is to have each DataSource emit a list of periods for which it has contiguous data. In particular, emit a pd.DataFrame with two columns: start_dt and end_dt; where each row represents a contiguous period of data. But, before implementing our own Period class with a Period.intersection method, we should check if pd.Period can handle arbitrary periods and/or if we can re-use pd.PeriodIndex.intersection().

This should also enable the implementation of #135

This might be important for WP1. I'll add it to the WP1 project for now, and we'll see how ML training goes with half-hour data.

Sub-tasks

  • Override DataSource.get_contiguous_time_periods() in SatelliteDataSource (?) to remove nighttime. UPDATE: This is already implemented by SatelliteDataSource.datetime_index()!

Done in PR #220:

  • Split the modified README into a separate PR.
  • Implement a nd_time.get_contiguous_time_periods() -> pd.DataFrame
  • Write test for nd_time.get_contiguous_time_periods()
  • add more test cases for intersection function, where the two periods are the same

Done in PR #256

  • Implement a DataSource.get_contiguous_time_periods() -> pd.DataFrame to emit a list of valid time periods. Use nd_time.get_contiguous_time_periods().
  • Write test(s) for DataSource.get_contiguous_time_periods()

PR #274

  • Implement a DataSource.get_contigous_t0_time_periods() -> pd.DataFrame which goes through each period and chops off history_duration from the beginning of the period, and chops off forecast_duration from the end of the period.
  • Enable NowcastingDataModule to compute the intersection of all the lists of t0 time periods from each DataSource. Use nd_time.intersection_of_2_dataframes_of_periods().
  • Compute t0 datetimes across all those periods (using a user-specified frequency, e.g. '5 minutes').
  • As before, split those t0 datetimes into train, valid, test
  • Remove nd_time.get_start_datetimes(), nd_time.intersection_of_datetimeindexes(), DataSource.get_t0_datetimes(), and their tests, and use grep to check they're not called from anywhere I've missed.
@JackKelly JackKelly added enhancement New feature or request data New data source or feature; or modification of existing data source labels Oct 12, 2021
@JackKelly
Copy link
Member Author

JackKelly commented Oct 13, 2021

I've been investigating using pd.Period and pd.PeriodIndex.intersection. The good news is that we can instantiate pd.Period objects with arbitary start times and end times (by passing in an arbitrary pd.Timedelta as freq). The bad news is that pd.PeriodIndex.intersection doesn't actually find the intersection of all the periods!

In [62]: period = pd.Period(pd.Timestamp("2021-01-01T12:34"), freq=pd.Timedelta(123, unit="seconds"))

In [63]: period2 = pd.Period(pd.Timestamp("2021-01-01T12:35"), freq=pd.Timedelta(10, unit="minutes"))

In [64]: pi = pd.PeriodIndex([period])

In [65]: pi2 = pd.PeriodIndex([period2])

In [66]: pi
Out[66]: PeriodIndex(['2021-01-01 12:34:00'], dtype='period[123S]')

In [67]: pi2
Out[67]: PeriodIndex(['2021-01-01 12:35'], dtype='period[10T]')

In [68]: pi.intersection(pi2)
Out[70]: Index([], dtype='object')

In [71]: pi2.intersection(pi)
Out[73]: Index([], dtype='object')

In [74]: period.start_time
Out[76]: Timestamp('2021-01-01 12:34:00')

In [77]: period.end_time
Out[79]: Timestamp('2021-01-01 12:36:02.999999999')

# And, worse, it's not possible to create a PeriodIndex with multiple arbitrary-length Periods:
In [153]: pd.PeriodIndex([period, period2])
---------------------------------------------------------------------------
IncompatibleFrequency                     Traceback (most recent call last)
<ipython-input-153-01722ae41681> in <module>
----> 1 pd.PeriodIndex([period, period2])

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pandas/core/indexes/period.py in __new__(cls, data, ordinal, freq, dtype, copy, name, **fields)
    259             else:
    260                 # don't pass copy here, since we copy later.
--> 261                 data = period_array(data=data, freq=freq)
    262 
    263         if copy:

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pandas/core/arrays/period.py in period_array(data, freq, copy)
   1010     data = ensure_object(arrdata)
   1011 
-> 1012     return PeriodArray._from_sequence(data, dtype=dtype)
   1013 
   1014 

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pandas/core/arrays/period.py in _from_sequence(cls, scalars, dtype, copy)
    248 
    249         freq = freq or libperiod.extract_freq(periods)
--> 250         ordinals = libperiod.extract_ordinals(periods, freq)
    251         return cls(ordinals, freq=freq)
    252 

~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pandas/_libs/tslibs/period.pyx in pandas._libs.tslibs.period.extract_ordinals()

IncompatibleFrequency: Input has different freq=10T from PeriodIndex(freq=123S)

@JackKelly
Copy link
Member Author

So I'll go back to the idea of implementing our own intesection_of_periods() function. Using pd.Series to represent an ordered list of time periods from each DataSource.

@JackKelly
Copy link
Member Author

JackKelly commented Oct 13, 2021

OK, I've implemented intersection_of_2_datetimes_of_periods() in commit 25ca0d0

@JackKelly JackKelly self-assigned this Oct 14, 2021
@flowirtz flowirtz moved this to Todo in Nowcasting Oct 15, 2021
@JackKelly JackKelly changed the title Sample t0 datetimes at 5 minute intervals instead of 30-minute intervals Get intersection of _Periods_ of available data across all DataSources, instead of using datetimes. Oct 18, 2021
@JackKelly JackKelly moved this from Todo to In Progress in Nowcasting Oct 18, 2021
@JackKelly JackKelly linked a pull request Oct 18, 2021 that will close this issue
20 tasks
@JackKelly JackKelly linked a pull request Oct 20, 2021 that will close this issue
7 tasks
Repository owner moved this from In Progress to Done in Nowcasting Oct 21, 2021
@JackKelly JackKelly moved this from Done to In Progress in Nowcasting Oct 22, 2021
@JackKelly JackKelly reopened this Oct 22, 2021
JackKelly added a commit that referenced this issue Oct 22, 2021
Repository owner moved this from In Progress to Done in Nowcasting Oct 25, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
data New data source or feature; or modification of existing data source enhancement New feature or request
Projects
No open projects
Status: Done
1 participant