-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
API: generalized check_array_indexer for validating array-like getitem indexers #31150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
e8f539a
4fa9f5a
b55dfd2
095b741
58bfe78
5ce8d85
ebc2150
4a51d97
50490aa
c979df8
ce2e042
4d447bf
d930e84
9ed8fe9
2f8cd27
4d9a201
097d221
3c5e4c6
1ca35d1
e5ea9b4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -42,7 +42,7 @@ | |
from pandas.core.algorithms import checked_add_with_arr, take, unique1d, value_counts | ||
from pandas.core.arrays.base import ExtensionArray, ExtensionOpsMixin | ||
import pandas.core.common as com | ||
from pandas.core.indexers import check_bool_array_indexer | ||
from pandas.core.indexers import check_array_indexer | ||
from pandas.core.ops.common import unpack_zerodim_and_defer | ||
from pandas.core.ops.invalid import invalid_comparison, make_invalid_op | ||
|
||
|
@@ -517,8 +517,12 @@ def __getitem__(self, key): | |
return self._box_func(val) | ||
return type(self)(val, dtype=self.dtype) | ||
|
||
if is_list_like(key): | ||
key = check_array_indexer(self, key) | ||
|
||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if com.is_bool_indexer(key): | ||
key = check_bool_array_indexer(self, key) | ||
# can still have object dtype | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are there any uses of is_bool_indexer left? can you just get rid of them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we still use it in several places. As long as we don't have deprecated+removed boolean indexing with object dtype (again, see the non-inline discussion in this PR, #31150 (comment)), we will need this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can u create an issue to rename / refactor / remove is_book_indexer then and now we have 2 ways of checking booking indexers either check_array_indexer should completely subsume it or it should be renamed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, because there are places we allow object dtype (for backwards compatibility), and there are places where we are more strict. Why would it need to be renamed? Or what name do you suggest? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i would just incorporate it in check_array_indexer to be honest i have to know that is_bool_indexer is something that doesn’t check indexing except in object arrays |
||
key = np.asarray(key, dtype=bool) | ||
if key.all(): | ||
key = slice(0, None, None) | ||
else: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,6 +29,7 @@ | |
is_datetime64_any_dtype, | ||
is_dtype_equal, | ||
is_integer, | ||
is_list_like, | ||
is_object_dtype, | ||
is_scalar, | ||
is_string_dtype, | ||
|
@@ -43,6 +44,7 @@ | |
from pandas.core.base import PandasObject | ||
import pandas.core.common as com | ||
from pandas.core.construction import sanitize_array | ||
from pandas.core.indexers import check_array_indexer | ||
from pandas.core.missing import interpolate_2d | ||
import pandas.core.ops as ops | ||
from pandas.core.ops.common import unpack_zerodim_and_defer | ||
|
@@ -768,6 +770,9 @@ def __getitem__(self, key): | |
else: | ||
key = np.asarray(key) | ||
|
||
if is_list_like(key): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this repeated non purpose? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. repeated from where? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the next check is_bool_indexer is duplicative There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not fully duplicative, see my long explanation at #31150 (comment). It's mainly for dealing with object dtype. |
||
key = check_array_indexer(self, key) | ||
|
||
if com.is_bool_indexer(key): | ||
key = check_bool_indexer(self, key) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,7 +5,12 @@ | |
|
||
from pandas._typing import AnyArrayLike | ||
|
||
from pandas.core.dtypes.common import is_list_like | ||
from pandas.core.dtypes.common import ( | ||
is_array_like, | ||
is_bool_dtype, | ||
is_integer_dtype, | ||
is_list_like, | ||
) | ||
from pandas.core.dtypes.generic import ABCIndexClass, ABCSeries | ||
|
||
# ----------------------------------------------------------- | ||
|
@@ -307,3 +312,62 @@ def check_bool_array_indexer(array: AnyArrayLike, mask: AnyArrayLike) -> np.ndar | |
if len(result) != len(array): | ||
raise IndexError(f"Item wrong length {len(result)} instead of {len(array)}.") | ||
return result | ||
|
||
|
||
def check_array_indexer(array, indexer) -> np.ndarray: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you type these at all, shouldn't indexer -> key and be Label (or maybe something more sophisticated); not looking to solve this in this PR necessarily |
||
""" | ||
Check if `indexer` is a valid array indexer for `array`. | ||
|
||
`array` and `indexer` are checked to have the same length, and the | ||
dtype is validated. If it is an integer or boolean ExtensionArray, it is | ||
checked if there are missing values present, and it is converted to | ||
the appropriate numpy array. | ||
|
||
.. versionadded:: 1.0.0 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 1.0 or 1.1? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 1.0 if we're planning to subsume check_bool_array_indexer. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, this is replacing |
||
|
||
Parameters | ||
---------- | ||
array : array | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can this be made more specific, e.g. "np.ndarray or EA"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's only used to get the length, so made it "array-like" (can in principle also be a Series) |
||
The array that's being indexed (only used for the length). | ||
indexer : array-like | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In a few places above, you've done Thoughts on what we want? Requiring an array is certainly easier, so that we don't have to infer the types. But users may be passing arbitrary objects to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We actually don't require an array with a dtype. The first thing that this function does is:
to deal with eg lists. So I probably meant to update the array into "list-like" instead of "array-like" |
||
The array-like that's used to index. | ||
|
||
Returns | ||
------- | ||
numpy.ndarray | ||
The validated indexer. | ||
|
||
Raises | ||
------ | ||
IndexError | ||
When the lengths don't match. | ||
ValueError | ||
When `indexer` cannot be converted to a numpy ndarray. | ||
|
||
""" | ||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
import pandas as pd | ||
|
||
if not is_array_like(indexer): | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
indexer = pd.array(indexer) | ||
dtype = indexer.dtype | ||
if is_bool_dtype(dtype): | ||
try: | ||
indexer = np.asarray(indexer, dtype=bool) | ||
except ValueError: | ||
raise ValueError("Cannot mask with a boolean indexer containing NA values") | ||
|
||
# GH26658 | ||
if len(indexer) != len(array): | ||
raise IndexError( | ||
f"Item wrong length {len(indexer)} instead of {len(array)}." | ||
) | ||
|
||
elif is_integer_dtype(dtype): | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
try: | ||
indexer = np.asarray(indexer, dtype=int) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does int vs np.int64 vs np.intp matter here? are there failure modes other than the presence of NAs? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this does matter; indexers are intp There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that was on my todo to fix up. Need to figure out the easiest way to convert to numpy array preserving the bit-ness of the dtype (or can we always convert to intp?) Will update tomorrow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, went with np.intp. From a quick test, when you pass non-intp integers to index with numpy, it's not slower to do the conversion to intp yourself beforehand (although while writing this, what happens if you try to index with a too large int64 that doesn't fit into int32 on a 32-bit platform?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ensure_platform_int is a well established pattern There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you prefer to update There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. either way - but should be consistent and use only 1 pattern; ensure_platform_int is used extensively already |
||
except ValueError: | ||
raise ValueError( | ||
"Cannot index with an integer indexer containing NA values" | ||
) | ||
|
||
return indexer |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -976,8 +976,9 @@ def test_engine_type(self, dtype, engine_type): | |
assert np.issubdtype(ci.codes.dtype, dtype) | ||
assert isinstance(ci._engine, engine_type) | ||
|
||
def test_getitem_2d_deprecated(self): | ||
def test_getitem_raise_2d(self): | ||
# GH#30588 multi-dim indexing is deprecated, but raising is also acceptable | ||
idx = self.create_index() | ||
with pytest.raises(ValueError, match="cannot mask with array containing NA"): | ||
msg = "Cannot user indexer with multiple dimensions" | ||
with pytest.raises(IndexError, match=msg): | ||
idx[:, None] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was removed because the base class version (which checks for the deprecation) now passes (since I added the deprecation warning) |
Uh oh!
There was an error while loading. Please reload this page.