Skip to content

BUG: DataFrameGroupBy.groups fails when Categorical indexer contains NaNs and dropna=False #61356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
tehunter opened this issue Apr 25, 2025 · 1 comment · Fixed by #61364
Closed
3 tasks done
Labels
Bug Categorical Categorical Data Type Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@tehunter
Copy link
Contributor

tehunter commented Apr 25, 2025

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> df = DataFrame(
...         {
...             "cat": Categorical(["a", np.nan, "a"], categories=["a", "b", "d"]),
...             "vals": [1, 2, 3],
...         }
...     )
>>> g = df.groupby("cat", observed=True, dropna=False)
>>> result = g.groups
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/workspaces/pandas/pandas/core/groupby/groupby.py", line 569, in groups
    return self._grouper.groups
  File "properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
  File "/workspaces/pandas/pandas/core/groupby/ops.py", line 710, in groups
    return self.groupings[0].groups
  File "properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
  File "/workspaces/pandas/pandas/core/groupby/grouper.py", line 711, in groups
    return codes, uniques
  File "/workspaces/pandas/pandas/core/arrays/categorical.py", line 745, in from_codes
    dtype = CategoricalDtype._from_values_or_dtype(
  File "/workspaces/pandas/pandas/core/dtypes/dtypes.py", line 347, in _from_values_or_dtype
    dtype = CategoricalDtype(categories, ordered)
  File "/workspaces/pandas/pandas/core/dtypes/dtypes.py", line 230, in __init__
    self._finalize(categories, ordered, fastpath=False)
  File "/workspaces/pandas/pandas/core/dtypes/dtypes.py", line 387, in _finalize
    categories = self.validate_categories(categories, fastpath=fastpath)
  File "/workspaces/pandas/pandas/core/dtypes/dtypes.py", line 585, in validate_categories
    raise ValueError("Categorical categories cannot be null")
ValueError: Categorical categories cannot be null
>>>

Issue Description

When using df.groupby(cat, dropna=False).groups, we encounter a ValueError. This is counter-intuitive, as grouping operations work without an issue.

>>> df = DataFrame(
...         {
...             "cat": Categorical(["a", np.nan, "a"], categories=["a", "b", "d"]),
...             "vals": [1, 2, 3],
...         }
...     )
>>> g = df.groupby("cat", observed=True, dropna=False)
>>> g.sum()
     vals
cat      
a       4
NaN     2
>>> g.sum().index
CategoricalIndex(['a', nan], categories=['a', 'b', 'd'], ordered=False, dtype='category', name='cat')

Expected Behavior

.groups should return a dictionary which includes the NaN as the last entry.

Installed Versions

INSTALLED VERSIONS

commit : 41131a1
python : 3.10.8
python-bits : 64
OS : Linux
OS-release : 5.15.167.4-microsoft-standard-WSL2
Version : #1 SMP Tue Nov 5 00:21:55 UTC 2024
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0.dev0+2085.g41131a1432
numpy : 2.2.5
dateutil : 2.9.0.post0
pip : 25.0.1
Cython : 3.0.12
sphinx : 8.1.3
IPython : 8.35.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.13.4
blosc : None
bottleneck : 1.4.2
fastparquet : 2024.11.0
fsspec : 2025.3.2
html5lib : 1.1
hypothesis : 6.131.8
gcsfs : 2025.3.2
jinja2 : 3.1.6
lxml.etree : 5.4.0
matplotlib : 3.10.1
numba : 0.61.2
numexpr : 2.10.2
odfpy : None
openpyxl : 3.1.5
psycopg2 : 2.9.10
pymysql : 1.4.6
pyarrow : 19.0.1
pyreadstat : 1.2.8
pytest : 8.3.5
python-calamine : None
pytz : 2025.2
pyxlsb : 1.0.10
s3fs : 2025.3.2
scipy : 1.15.2
sqlalchemy : 2.0.40
tables : 3.10.1
tabulate : 0.9.0
xarray : 2024.9.0
xlrd : 2.0.1
xlsxwriter : 3.2.3
zstandard : 0.23.0
tzdata : 2025.2
qtpy : None
pyqt5 : None

@tehunter tehunter added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 25, 2025
@rhshadrach rhshadrach added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Categorical Categorical Data Type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 27, 2025
@rhshadrach rhshadrach added this to the 3.0 milestone Apr 27, 2025
@rhshadrach
Copy link
Member

Thanks for the report. Confirmed on main. PR to fix is up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants