-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Support NDFrame.shift with EAs #22387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Uses take internally. Closes pandas-dev#22386
i think it’s worth adding to the interface itself |
Wouldn't it be possible to do this without take, but with slicing? Something like (very crude implementation for positive shifts):
Speed wise I would expect this to be very similar to the current |
Codecov Report
@@ Coverage Diff @@
## master #22387 +/- ##
==========================================
- Coverage 92.05% 92.05% -0.01%
==========================================
Files 169 169
Lines 50733 50743 +10
==========================================
+ Hits 46702 46711 +9
- Misses 4031 4032 +1
Continue to review full report at Codecov.
|
Categorical was apparently a very bad example as there it is like 5x slower than the existing |
That seems preferable to me, as it avoids allocating the |
pandas/core/arrays/base.py
Outdated
shifted : ExtensionArray | ||
""" | ||
if periods == 0: | ||
return self.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on this? non-zero periods will necessarily be a copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although Series.shift(0)
doesn't make a copy, so let's follow that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find that strange behaviour to not copy then (it would be more logical to know that the result is always a copy, regardless the value of periods IMO)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #22397 for that. Will leave this as is then.
Just as a note, I couldn't quite use this When The root issue is that SparseArray stores fill_value on the array, rather than on the dtype. I haven't looked into whether it should go on the dtype or not, it's not clear to me. Regardless, I think this is good to go, if we're happy with the slice and concat approach. |
From that it seems to makes sense to have this on the dtype then (but not that familiar with the rest of the code) |
Indeed... On the other hand, it complicates things like equality a bit... I'll push it onto the dtype and see what breaks. |
Any objections to merging? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we be able to change the impl of interval / categorical shift for this?
@@ -400,6 +400,36 @@ def dropna(self): | |||
|
|||
return self[~self.isna()] | |||
|
|||
def shift(self, periods=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think update in the ExtensionArray doc-string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, in the class docstring we only mention the methods that either needs to implemented (because they raise AbstractMethodError otherwise) or either have a suboptimal implementation because it does the object ndarray roundtrip.
This is not the case here (which is not saying we couldn't also list other methods that can be overriden for specific reasons)
As I mentioned above (#22387 (comment)), if we implement Categorical.shift as is done here (eg by simply inheriting this base method), it gets much slower because the creating of the NaNs and concatting same typed-categoricals is very slow (at least slower as it should be). |
(0, [0, 1, 2, 3, 4]), | ||
(2, [-1, -1, 0, 1, 2]), | ||
]) | ||
def test_container_shift_negative(self, data, frame, periods, indices): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor comment, but is there a reason this is called test_container_shift_negative
? (because it does not only test negative periods, and 'container' seems a bit superfluous). I would simply do test_shift
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover from a merger of multiple tests. Will fix the name.
I do test_container_shift
, rather than just test_shift
, as this test goes through Series(array).shift
, rather than directly doing array.shift
.
It seems that In [22]: idx = IntervalArray.from_breaks(pd.date_range('2017', periods=10))
In [23]: idx.shift()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-23-1b2c2192e1e6> in <module>()
----> 1 idx.shift()
~/sandbox/pandas/pandas/core/arrays/base.py in shift(self, periods)
422 return self.copy()
423 empty = self._from_sequence([self.dtype.na_value] * abs(periods),
--> 424 dtype=self.dtype)
425 if periods > 0:
426 a = empty
~/sandbox/pandas/pandas/core/arrays/interval.py in _from_sequence(cls, scalars, dtype, copy)
193 @classmethod
194 def _from_sequence(cls, scalars, dtype=None, copy=False):
--> 195 return cls(scalars, dtype=dtype, copy=copy)
196
197 @classmethod
~/sandbox/pandas/pandas/core/arrays/interval.py in __new__(cls, data, closed, dtype, copy, fastpath, verify_integrity)
138
139 return cls._simple_new(left, right, closed, copy=copy, dtype=dtype,
--> 140 verify_integrity=verify_integrity)
141
142 @classmethod
~/sandbox/pandas/pandas/core/arrays/interval.py in _simple_new(cls, left, right, closed, copy, dtype, verify_integrity)
156 raise TypeError(msg.format(dtype=dtype))
157 elif dtype.subtype is not None:
--> 158 left = left.astype(dtype.subtype)
159 right = right.astype(dtype.subtype)
160
~/sandbox/pandas/pandas/core/indexes/numeric.py in astype(self, dtype, copy)
313 msg = ('Cannot convert Float64Index to dtype {dtype}; integer '
314 'values are required for conversion').format(dtype=dtype)
--> 315 raise TypeError(msg)
316 elif is_integer_dtype(dtype) and self.hasnans:
317 # GH 13149
TypeError: Cannot convert Float64Index to dtype datetime64[ns]; integer values are required for conversion
In [24]: idx = IntervalArray.from_breaks(range(10))
In [25]: idx.shift()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-25-1b2c2192e1e6> in <module>()
----> 1 idx.shift()
~/sandbox/pandas/pandas/core/arrays/base.py in shift(self, periods)
422 return self.copy()
423 empty = self._from_sequence([self.dtype.na_value] * abs(periods),
--> 424 dtype=self.dtype)
425 if periods > 0:
426 a = empty
~/sandbox/pandas/pandas/core/arrays/interval.py in _from_sequence(cls, scalars, dtype, copy)
193 @classmethod
194 def _from_sequence(cls, scalars, dtype=None, copy=False):
--> 195 return cls(scalars, dtype=dtype, copy=copy)
196
197 @classmethod
~/sandbox/pandas/pandas/core/arrays/interval.py in __new__(cls, data, closed, dtype, copy, fastpath, verify_integrity)
138
139 return cls._simple_new(left, right, closed, copy=copy, dtype=dtype,
--> 140 verify_integrity=verify_integrity)
141
142 @classmethod
~/sandbox/pandas/pandas/core/arrays/interval.py in _simple_new(cls, left, right, closed, copy, dtype, verify_integrity)
156 raise TypeError(msg.format(dtype=dtype))
157 elif dtype.subtype is not None:
--> 158 left = left.astype(dtype.subtype)
159 right = right.astype(dtype.subtype)
160
~/sandbox/pandas/pandas/core/indexes/numeric.py in astype(self, dtype, copy)
316 elif is_integer_dtype(dtype) and self.hasnans:
317 # GH 13149
--> 318 raise ValueError('Cannot convert NA to integer')
319 return super(Float64Index, self).astype(dtype, copy=copy)
320
ValueError: Cannot convert NA to integer The basic issue is
The container (ndarray[self.dtype]) can't actually hold Does that change our opinion on this implementation? |
I think that is rather a bug in IntervalArray? (or maybe better a current limitation) Anyway, if the point is that it (currently) cannot hold NaNs, it can also not implement |
Well, it can implement |
Basically, the problem is that for IntervalArray, the data is stored in For the datetime error above, that is actually solvable with the current model, because if the data is not all-NaN it works:
|
Ah, yes, the same happens with I think ideally, we should improve the missing values handling on the IntervalArray level (since it's a composite array, we have the ability to do it properly, not requiring such upcasts), but that's certainly out of scope for this PR :-) |
So using the I don't really care much to whether to switch back to that or leave the implementation now as is (with the consequence shift doesn't work for Interval; although given the |
That's my preference. Opened #22428 for that. |
CI is passing now (last one was a random failure). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. minor doc-comments.
Shift values by desired number. | ||
|
||
Newly introduced missing values are filled with | ||
``self.dtype.na_value``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a versionadded tag
pandas/core/internals/blocks.py
Outdated
@@ -2068,6 +2068,12 @@ def interpolate(self, method='pad', axis=0, inplace=False, limit=None, | |||
limit=limit), | |||
placement=self.mgr_locs) | |||
|
|||
def shift(self, periods, axis=0, mgr=None): | |||
# type: (int, Optional[BlockPlacement]) -> List[ExtensionBlock] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string
[ci skip]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. minor comments. merge away
|
||
Dispatches to underlying ExtensionArray and re-boxes in an | ||
ExtensionBlock. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bonus points for the Parameters :> (future PR ok too)
thanks @TomAugspurger |
Uses take internally.
Closes #22386
One question: do we want to add
shift
to the interface, and dispatch to that? This just does the right thing at the block level, by writing ashift
that'll work with an ExtensionArray.CategoricalBlock still overrides
.shift
to callCategorical.shift
, which uses a slightly faster implementation based around the codes andnp.roll
(~1.5-2x faster based on microbenchmarks)`.We could add move
ExtensionBlock.shift
toExtensionArray.shift
, and just call that from ExtensionBlock. Then we can delete CategoricalBlock.shift.I'm 50/50 on whether it's worth adding to the interface.