Skip to content

PERF: Restore old performances with .isin() on columns typed as np.ui… #61320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

pbrochart
Copy link

…nt64

Only if dtypes are equal (e.g uint64 vs uint64, uint32 vs uint32...)

%timeit data["uints"].isin([np.uint64(1), np.uint64(2)]) # 17ms (!)
The last line, with older numpy==1.26.4 (last version <2.0), is even worse: ~200ms.

@pbrochart
Copy link
Author

pre-commit.ci autofix

@pbrochart
Copy link
Author

Implicit conversion to float64 happens only whith uint64/int64.
I reverted the PR #46693 to provide an example based on initial issue #46485:

import pandas as pd
import numpy as np
test_df = pd.DataFrame([{'a': 1378774140726870442}], dtype=np.uint64)

print(1378774140726870442 == 1378774140726870528) 
#False

print(test_df['a'].isin([1378774140726870528])[0])
#True

print(test_df['a'].isin([1])[0])
#False

The second test must be False and was handled by the PR #46693
because there is implicit conversion to float64.
But if we change it to:

print(test_df['a'].isin([np.uint64(1378774140726870528)])[0])
#False

The result is correct because in this case there is no implicit conversion so it's not necessary to use object.
Regarding the performance, it's resolves partially the issue #60098:

Before:

import pandas as pd, numpy as np
data = pd.DataFrame({
    "uints": np.random.randint(10000, size=300000, dtype=np.uint64),
    "ints": np.random.randint(10000, size=300000, dtype=np.int64),
})

%timeit data["uints"].isin([np.uint64(1), np.uint64(2)]) # 239ms

After:

import pandas as pd, numpy as np
data = pd.DataFrame({
    "uints": np.random.randint(10000, size=300000, dtype=np.uint64),
    "ints": np.random.randint(10000, size=300000, dtype=np.int64),
})

%timeit data["uints"].isin([np.uint64(1), np.uint64(2)]) # 4ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Slowdowns with .isin() on columns typed as np.uint64
1 participant