Skip to content

IntervalArray.shift and missing value handling #22428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Aug 20, 2018 · 5 comments
Closed

IntervalArray.shift and missing value handling #22428

TomAugspurger opened this issue Aug 20, 2018 · 5 comments
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Interval Interval data type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@TomAugspurger
Copy link
Contributor

Followup to #22387

The default implementation of shift fails when dtype.na_dtype can't be stored in a dtype array (e.g. int can't hold na).

In [24]: idx = IntervalArray.from_breaks(range(10))

In [25]: idx.shift()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-25-1b2c2192e1e6> in <module>()
----> 1 idx.shift()

~/sandbox/pandas/pandas/core/arrays/base.py in shift(self, periods)
    422             return self.copy()
    423         empty = self._from_sequence([self.dtype.na_value] * abs(periods),
--> 424                                     dtype=self.dtype)
    425         if periods > 0:
    426             a = empty

~/sandbox/pandas/pandas/core/arrays/interval.py in _from_sequence(cls, scalars, dtype, copy)
    193     @classmethod
    194     def _from_sequence(cls, scalars, dtype=None, copy=False):
--> 195         return cls(scalars, dtype=dtype, copy=copy)
    196
    197     @classmethod

~/sandbox/pandas/pandas/core/arrays/interval.py in __new__(cls, data, closed, dtype, copy, fastpath, verify_integrity)
    138
    139         return cls._simple_new(left, right, closed, copy=copy, dtype=dtype,
--> 140                                verify_integrity=verify_integrity)
    141
    142     @classmethod

~/sandbox/pandas/pandas/core/arrays/interval.py in _simple_new(cls, left, right, closed, copy, dtype, verify_integrity)
    156                 raise TypeError(msg.format(dtype=dtype))
    157             elif dtype.subtype is not None:
--> 158                 left = left.astype(dtype.subtype)
    159                 right = right.astype(dtype.subtype)
    160

~/sandbox/pandas/pandas/core/indexes/numeric.py in astype(self, dtype, copy)
    316         elif is_integer_dtype(dtype) and self.hasnans:
    317             # GH 13149
--> 318             raise ValueError('Cannot convert NA to integer')
    319         return super(Float64Index, self).astype(dtype, copy=copy)
    320

ValueError: Cannot convert NA to integer

Perhaps we can investigate using our IntegerNA extension array for the storage of int-dtyped IntervalArrays?

@TomAugspurger TomAugspurger added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Interval Interval data type ExtensionArray Extending pandas with custom dtypes or arrays. labels Aug 20, 2018
@jorisvandenbossche jorisvandenbossche added this to the 0.24.0 milestone Aug 22, 2018
@jorisvandenbossche
Copy link
Member

Perhaps we can investigate using our IntegerNA extension array for the storage of int-dtyped IntervalArrays?

Alternatively, we could also thinking about using a mask-based approach for missing values (similar as for IntegerArray).
I don't know how much extra complexity it would cause (I would think mainly changes in isna and in construction methods), and for sure it would cause an increase in memeory, but it would ensure a consistent handling of missing values regardless of the dtype of the breaks (whether that dtype has missing value support or not).

@jorisvandenbossche
Copy link
Member

cc @jschendel

@jorisvandenbossche
Copy link
Member

Of course, if people directly access and use .left and .right`, this might give unexpected results ..

@jreback jreback modified the milestones: 0.24.0, 0.25.0 Nov 6, 2018
@jreback jreback modified the milestones: 0.25.0, Contributions Welcome Apr 20, 2019
@mroeschke
Copy link
Member

This looks close to solved, probably more ideal if we used IntegerNA instead of converting to float

In [52]: pandas.arrays.IntervalArray.from_breaks(range(10)).shift()
Out[52]:
<IntervalArray>
[nan, (0.0, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0], (4.0, 5.0], (5.0, 6.0], (6.0, 7.0], (7.0, 8.0]]
Length: 9, closed: right, dtype: interval[float64]

In [53]: pd.__version__
Out[53]: '1.1.0.dev0+1390.gf3fdab389'

@jbrockmendel
Copy link
Member

Closed by #31502

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Interval Interval data type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

5 participants