Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix year-weighted sampling #1778

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Mar 21, 2025

Description of proposed changes

Ensure numeric metadata values are stored as string instead of integer.

Related issue(s)

Closes #1776

Checklist

  • Automated checks pass
  • Check if you need to add a changelog message
  • Check if you need to add tests
  • Check if you need to update docs

@victorlin victorlin self-assigned this Mar 21, 2025
Storing it as integer has no additional benefits, since subsampling
treats all groups (including date parts) as categorical.
Data types should match the metadata DataFrame, which reads all values
as string.

Interestingly, the default weight query needs to be updated - the
previous equality check against 'default' worked with np.nan but no
longer works with pd.NA. An explicit notna() condition is now necessary.
@victorlin victorlin force-pushed the victorlin/fix-year-weighted-sampling branch from d64bea4 to d0cd2fd Compare March 21, 2025 21:30
Copy link

codecov bot commented Mar 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.21%. Comparing base (bbadca3) to head (762324a).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1778   +/-   ##
=======================================
  Coverage   73.21%   73.21%           
=======================================
  Files          79       79           
  Lines        8395     8395           
  Branches     1713     1713           
=======================================
  Hits         6146     6146           
  Misses       1960     1960           
  Partials      289      289           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Set should be used under the hood, but the order of sets is not
guaranteed and it is best to have a guaranteed order for
reproducibility. A simple alphabetical sort is sufficient.

This became prominent when switching the internal representation of
numeric columns such as 'year' from integer to string. The order of
integers seems to be more consistent with sets compared to strings,
and this was inadvertently relied upon in some functional tests.
@victorlin victorlin force-pushed the victorlin/fix-year-weighted-sampling branch from d0cd2fd to 762324a Compare March 21, 2025 22:06
@victorlin victorlin marked this pull request as ready for review March 21, 2025 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pandas error when weights file has year column and that column contains "default"
1 participant