This repository was archived by the owner on Sep 11, 2023. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 6
Get intersection of _Periods_ of available data across all DataSources
, instead of using datetimes.
#223
Labels
data
New data source or feature; or modification of existing data source
enhancement
New feature or request
Comments
38 tasks
I've been investigating using In [62]: period = pd.Period(pd.Timestamp("2021-01-01T12:34"), freq=pd.Timedelta(123, unit="seconds"))
In [63]: period2 = pd.Period(pd.Timestamp("2021-01-01T12:35"), freq=pd.Timedelta(10, unit="minutes"))
In [64]: pi = pd.PeriodIndex([period])
In [65]: pi2 = pd.PeriodIndex([period2])
In [66]: pi
Out[66]: PeriodIndex(['2021-01-01 12:34:00'], dtype='period[123S]')
In [67]: pi2
Out[67]: PeriodIndex(['2021-01-01 12:35'], dtype='period[10T]')
In [68]: pi.intersection(pi2)
Out[70]: Index([], dtype='object')
In [71]: pi2.intersection(pi)
Out[73]: Index([], dtype='object')
In [74]: period.start_time
Out[76]: Timestamp('2021-01-01 12:34:00')
In [77]: period.end_time
Out[79]: Timestamp('2021-01-01 12:36:02.999999999')
# And, worse, it's not possible to create a PeriodIndex with multiple arbitrary-length Periods:
In [153]: pd.PeriodIndex([period, period2])
---------------------------------------------------------------------------
IncompatibleFrequency Traceback (most recent call last)
<ipython-input-153-01722ae41681> in <module>
----> 1 pd.PeriodIndex([period, period2])
~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pandas/core/indexes/period.py in __new__(cls, data, ordinal, freq, dtype, copy, name, **fields)
259 else:
260 # don't pass copy here, since we copy later.
--> 261 data = period_array(data=data, freq=freq)
262
263 if copy:
~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pandas/core/arrays/period.py in period_array(data, freq, copy)
1010 data = ensure_object(arrdata)
1011
-> 1012 return PeriodArray._from_sequence(data, dtype=dtype)
1013
1014
~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pandas/core/arrays/period.py in _from_sequence(cls, scalars, dtype, copy)
248
249 freq = freq or libperiod.extract_freq(periods)
--> 250 ordinals = libperiod.extract_ordinals(periods, freq)
251 return cls(ordinals, freq=freq)
252
~/miniconda3/envs/nowcasting_dataset/lib/python3.9/site-packages/pandas/_libs/tslibs/period.pyx in pandas._libs.tslibs.period.extract_ordinals()
IncompatibleFrequency: Input has different freq=10T from PeriodIndex(freq=123S) |
So I'll go back to the idea of implementing our own |
OK, I've implemented |
20 tasks
JackKelly
added a commit
that referenced
this issue
Oct 13, 2021
JackKelly
added a commit
that referenced
this issue
Oct 13, 2021
JackKelly
added a commit
that referenced
this issue
Oct 14, 2021
DataSources
, instead of using datetimes.
Closed
20 tasks
JackKelly
added a commit
that referenced
this issue
Oct 20, 2021
7 tasks
7 tasks
Repository owner
moved this from In Progress
to Done
in Nowcasting
Oct 21, 2021
12 tasks
JackKelly
added a commit
that referenced
this issue
Oct 22, 2021
…at might be a separate issue, see #276
Repository owner
moved this from In Progress
to Done
in Nowcasting
Oct 25, 2021
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
data
New data source or feature; or modification of existing data source
enhancement
New feature or request
Detailed Description
At present, when using GSP-region PV data,
nowcasting_dataset
produces ML training examples with t0 datetimes at 0 and 30 minutes past the hour.We should experiment with enabling
nowcasting_dataset
to set t0 datetimes to 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55 minutes past the hour. This would increase the number of training examples by 6x (which might be quite a big deal!) and, when we run a live service, we probably want to be able to update our PV forecasts every 5 minutes (when we get new satellite data & PVOutput.org data)Possible Implementation
Self-attention models don't need different modalities to be perfectly aligned in space and time. So, we could produce examples where t0 could be at any 5-minute increment, and there's always, say, exactly 2 historical timesteps of GSP PV data; and 4 forecast timesteps of GSP PV data. For example, an example with t0 = 12:15, history_minutes = 60, and forecast_minutes=120 would look something like this:
I think
DataSource.get_example
should work as required already, although it'll need testing.The tricker bit is changing
DataModule._get_t0_datetimes()
to be independent of the sample period of the variousDataSources
. I think the way forwards here is to have eachDataSource
emit a list of periods for which it has contiguous data. In particular, emit apd.DataFrame
with two columns:start_dt
andend_dt
; where each row represents a contiguous period of data. But, before implementing our ownPeriod
class with aPeriod.intersection
method, we should check ifpd.Period
can handle arbitrary periods and/or if we can re-usepd.PeriodIndex.intersection()
.This should also enable the implementation of #135
This might be important for WP1. I'll add it to the WP1 project for now, and we'll see how ML training goes with half-hour data.
Sub-tasks
OverrideUPDATE: This is already implemented byDataSource.get_contiguous_time_periods()
inSatelliteDataSource
(?) to remove nighttime.SatelliteDataSource.datetime_index()
!Done in PR #220:
nd_time.get_contiguous_time_periods() -> pd.DataFrame
nd_time.get_contiguous_time_periods()
Done in PR #256
DataSource.get_contiguous_time_periods() -> pd.DataFrame
to emit a list of valid time periods. Usend_time.get_contiguous_time_periods()
.DataSource.get_contiguous_time_periods()
PR #274
DataSource.get_contigous_t0_time_periods() -> pd.DataFrame
which goes through each period and chops offhistory_duration
from the beginning of the period, and chops offforecast_duration
from the end of the period.NowcastingDataModule
to compute the intersection of all the lists of t0 time periods from eachDataSource
. Usend_time.intersection_of_2_dataframes_of_periods()
.nd_time.get_start_datetimes()
,nd_time.intersection_of_datetimeindexes()
,DataSource.get_t0_datetimes()
, and their tests, and use grep to check they're not called from anywhere I've missed.The text was updated successfully, but these errors were encountered: