API: Specify the dtype of new columns added in reindex #33586

burk · 2020-04-16T09:33:23Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

df = pd.DataFrame({'x': [np.nan, 1., 2.]}).astype(pd.SparseDtype("float", np.nan))
df = df.reindex(['x', 'y'], axis='columns')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
x    -4 non-null Sparse[float64, nan]
y    -6 non-null float64
dtypes: Sparse[float64, nan](1), float64(1)
memory usage: 176.0 bytes

Problem description

When re-indexing the columns of a sparse dataframe, new columns are not sparse. This is problematic especially since the new columns would be completely sparse.

Expected Output

I'd expect that the new column was also of type Sparse[float64, 0.0].

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : None python : 3.7.5.final.0 python-bits : 64 OS : Linux OS-release : 5.3.0-45-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 0.25.0
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.2
Cython : 0.29.15
pytest : 5.3.5
hypothesis : None
sphinx : 2.4.1
blosc : None
feather : 0.4.0
xlsxwriter : None
lxml.etree : 4.4.2
html5lib : None
pymysql : 0.9.3
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
s3fs : None
scipy : 1.3.2
sqlalchemy : 1.3.13
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-04-16T14:03:48Z

I don't agree that the column y should automatically be sparse. That kind of implicit dependence to the dtypes on the the other columns would lead to surprises.

What reindex lacks is a way to specify the dtype of the new columns. Something like

df.reindex(columns=['x', 'y'], dtype=pd.SparseDtype('float64'))

would be reasonable.

This is closely related to #31874, where the dtype would be specified by the other DataFrame introducing new columns.

burk · 2020-04-17T06:48:45Z

Thanks for having a look. I agree that specifying the dtype of the new columns would be reasonable and sufficient.

TomAugspurger · 2020-05-04T15:56:09Z

Looks like this overlaps with #20513.

burk added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 16, 2020

TomAugspurger added API Design Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 16, 2020

TomAugspurger changed the title ~~BUG: Reindexing columns of sparse dataframe leads to new non-sparse columns~~ API: Specify the dtype of new columns added in reindex Apr 16, 2020

TomAugspurger closed this as completed May 4, 2020

TomAugspurger mentioned this issue May 4, 2020

Feature request: Let reindex fill with per-column methods #20513

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Specify the dtype of new columns added in reindex #33586

API: Specify the dtype of new columns added in reindex #33586

burk commented Apr 16, 2020

TomAugspurger commented Apr 16, 2020

burk commented Apr 17, 2020

TomAugspurger commented May 4, 2020

API: Specify the dtype of new columns added in reindex #33586

API: Specify the dtype of new columns added in reindex #33586

Comments

burk commented Apr 16, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented Apr 16, 2020

burk commented Apr 17, 2020

TomAugspurger commented May 4, 2020

Output of `pd.show_versions()`