Skip to content

BUG/TST: fix arrow roundtrip / parquet tests for recent pyarrow #30077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 18, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions pandas/core/arrays/boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,24 @@ def __repr__(self) -> str:
def _is_boolean(self) -> bool:
return True

def __from_arrow__(self, array):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be annotated? or at least have types in the docstring

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Dec 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type is mentioned in the docstring (just not in a parameters section, but can turn it into a more fully fledged docstring). No one except for pyarrow should be calling this method though.

Question for annotating: how does it work to annotate it with pyarrow objects that cannot necessarily be imported? (since it's an optional dependency)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea about the annotation?

"""Construct BooleanArray from passed pyarrow Array/ChunkedArray"""
import pyarrow

if isinstance(array, pyarrow.Array):
chunks = [array]
else:
# pyarrow.ChunkedArray
chunks = array.chunks

results = []
for arr in chunks:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u write this method generically and put on the base class or Arrow mxin class

as it already looks like it would work for any extension type (except the final use of BooleanArray)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IntegerDtype implementation is different, though. The implementation here is indeed more or less the same as for the StringArray, but once we we fix the mentioned TODO (to avoid going through object dtype), the Boolean one will also be custom.

The StringDtype one could still be put in a base mixin (it needs to be a mixin, and not directly in the base ExtensionDtype class, as for pyarrow the presence or absence of this method is relevant), but no one else would be using it for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, yeah we want avoid repeating this code as much as possible, so having ageneric (but working) impl would be good, and using helper / properties to ease the burden on each dtype would also be great. sure this could be done later as well.

# TODO should optimize this without going through object array
bool_arr = BooleanArray._from_sequence(np.array(arr))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool_arr = BooleanArray._from_sequence(np.array(arr))
bool_arr = BooleanArray._from_sequence(np.asarray(arr))

No?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way that a conversion from a pyarrow boolean array (which uses a bitmask) to a numpy array can be without a copy, so it shouldn't matter I think

results.append(bool_arr)

return BooleanArray._concat_same_type(results)


def coerce_to_array(values, mask=None, copy: bool = False):
"""
Expand Down
7 changes: 5 additions & 2 deletions pandas/core/arrays/string_.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ def __from_arrow__(self, array):

results = []
for arr in chunks:
# using _from_sequence to ensure None is convered to np.nan
# using _from_sequence to ensure None is convered to NA
str_arr = StringArray._from_sequence(np.array(arr))
results.append(str_arr)

Expand Down Expand Up @@ -208,7 +208,10 @@ def __arrow_array__(self, type=None):

if type is None:
type = pa.string()
return pa.array(self._ndarray, type=type, from_pandas=True)

values = self._ndarray.copy()
values[self.isna()] = None
return pa.array(values, type=type, from_pandas=True)

def _values_for_factorize(self):
arr = self._ndarray.copy()
Expand Down
4 changes: 2 additions & 2 deletions pandas/tests/arrays/string_/test_string.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,5 +235,5 @@ def test_arrow_roundtrip():
result = table.to_pandas()
assert isinstance(result["a"].dtype, pd.StringDtype)
tm.assert_frame_equal(result, df)
# ensure the missing value is represented by NaN and not None
assert np.isnan(result.loc[2, "a"])
# ensure the missing value is represented by NA and not np.nan or None
assert result.loc[2, "a"] is pd.NA
21 changes: 19 additions & 2 deletions pandas/tests/arrays/test_boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -757,12 +757,29 @@ def test_any_all(values, exp_any, exp_all, exp_any_noskip, exp_all_noskip):
# result = arr[mask]


@pytest.mark.skip(reason="broken test")
@td.skip_if_no("pyarrow", min_version="0.15.0")
def test_arrow_array(data):
# protocol added in 0.15.0
import pyarrow as pa

arr = pa.array(data)
expected = pa.array(np.array(data, dtype=object), type=pa.bool_(), from_pandas=True)

# TODO use to_numpy(na_value=None) here
data_object = np.array(data, dtype=object)
data_object[data.isna()] = None
expected = pa.array(data_object, type=pa.bool_(), from_pandas=True)
assert arr.equals(expected)


@td.skip_if_no("pyarrow", min_version="0.15.1.dev")
def test_arrow_roundtrip():
# roundtrip possible from arrow 1.0.0
import pyarrow as pa

data = pd.array([True, False, None], dtype="boolean")
df = pd.DataFrame({"a": data})
table = pa.table(df)
assert table.field("a").type == "bool"
result = table.to_pandas()
assert isinstance(result["a"].dtype, pd.BooleanDtype)
tm.assert_frame_equal(result, df)
1 change: 0 additions & 1 deletion pandas/tests/io/test_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -525,7 +525,6 @@ def test_write_with_schema(self, pa):
out_df = df.astype(bool)
check_round_trip(df, pa, write_kwargs={"schema": schema}, expected=out_df)

@pytest.mark.skip(reason="broken test")
@td.skip_if_no("pyarrow", min_version="0.15.0")
def test_additional_extension_arrays(self, pa):
# test additional ExtensionArrays that are supported through the
Expand Down