-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-51847][PYTHON] Extend PySpark testing framework util functions with basic data tests #50644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
cc @asl3 fyi |
remove |
Hi @HyukjinKwon, @zhengruifeng, @asl3 — just following up to see if you might have a chance to review the PR when time allows. Appreciate your time and input! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the contributions! these will be helpful test utils
@asl3 thanks for you review! i've implemented your suggestions. would be great if you could check it out when you have the time! |
Hi @HyukjinKwon, @zhengruifeng, @asl3 — just checking in on this PR. I’ve made the changes based on the earlier feedback, so let me know if there’s anything else you’d like to see. Would be great to get this moving if/when you have a moment. Thanks again! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!!
What changes were proposed in this pull request?
This PR extends the PySpark testing framework with four new utility functions for data quality and integrity testing:
assertColumnUnique
: Verifies that specified column(s) contain only unique valuesassertColumnNonNull
: Checks that specified column(s) do not contain null valuesassertColumnValuesInSet
: Ensures all values in specified column(s) are within a given set of accepted valuesassertReferentialIntegrity
: Validates that all non-null values in a source column exist in a target column (similar to foreign key constraints)Why are the changes needed?
These new utility functions address this gap by providing standardized, well-tested implementations of the most common data quality checks. They reduce boilerplate code, improve test readability, and enable testing patterns similar to those in popular data testing frameworks like dbt.
Does this PR introduce any user-facing change?
Yes, this PR introduces new public utility functions in the
pyspark.testing
module. These are additive changes that don't modify existing functionality.Example usage:
How was this patch tested?
Comprehensive tests were added for all new functions in
python/pyspark/sql/tests/test_utils.py
. The tests cover:Each function has multiple test methods that verify both positive and negative test cases. For example,
assertReferentialIntegrity
has tests for valid relationships, invalid relationships with a single missing value, multiple missing values, and proper handling of null values.All tests pass on the current master branch.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude 3.7 Sonnet