Skip to content

[SPARK-51847][PYTHON] Extend PySpark testing framework util functions with basic data tests #50644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

stanlocht
Copy link

What changes were proposed in this pull request?

This PR extends the PySpark testing framework with four new utility functions for data quality and integrity testing:

  1. assertColumnUnique: Verifies that specified column(s) contain only unique values
  2. assertColumnNonNull: Checks that specified column(s) do not contain null values
  3. assertColumnValuesInSet: Ensures all values in specified column(s) are within a given set of accepted values
  4. assertReferentialIntegrity: Validates that all non-null values in a source column exist in a target column (similar to foreign key constraints)

Why are the changes needed?

These new utility functions address this gap by providing standardized, well-tested implementations of the most common data quality checks. They reduce boilerplate code, improve test readability, and enable testing patterns similar to those in popular data testing frameworks like dbt.

Does this PR introduce any user-facing change?

Yes, this PR introduces new public utility functions in the pyspark.testing module. These are additive changes that don't modify existing functionality.

Example usage:

from pyspark.testing import assertColumnUnique, assertReferentialIntegrity

# Check that 'id' column contains only unique values
assertColumnUnique(df, "id")

# Check that all customer_ids in orders exist in customers.id
assertReferentialIntegrity(orders, "customer_id", customers, "id")

How was this patch tested?

Comprehensive tests were added for all new functions in python/pyspark/sql/tests/test_utils.py. The tests cover:

  • Basic functionality with valid inputs
  • Error cases with invalid inputs
  • Edge cases (e.g., null values, empty DataFrames)
  • Different DataFrame types (Spark, pandas, pandas-on-Spark)
  • Detailed validation of error messages

Each function has multiple test methods that verify both positive and negative test cases. For example, assertReferentialIntegrity has tests for valid relationships, invalid relationships with a single missing value, multiple missing values, and proper handling of null values.

All tests pass on the current master branch.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude 3.7 Sonnet

@HyukjinKwon
Copy link
Member

cc @asl3 fyi

@zhengruifeng zhengruifeng changed the title [SPARK-51847][PYTHON][TESTS] Extend PySpark testing framework util functions with basic data tests [SPARK-51847][PYTHON] Extend PySpark testing framework util functions with basic data tests Apr 21, 2025
@zhengruifeng
Copy link
Contributor

remove [TESTS] from title since it seems a new user-facing feature

@stanlocht
Copy link
Author

Hi @HyukjinKwon, @zhengruifeng, @asl3 — just following up to see if you might have a chance to review the PR when time allows. Appreciate your time and input!

Copy link
Contributor

@asl3 asl3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the contributions! these will be helpful test utils

@stanlocht
Copy link
Author

@asl3 thanks for you review! i've implemented your suggestions. would be great if you could check it out when you have the time!

@stanlocht
Copy link
Author

Hi @HyukjinKwon, @zhengruifeng, @asl3 — just checking in on this PR. I’ve made the changes based on the earlier feedback, so let me know if there’s anything else you’d like to see. Would be great to get this moving if/when you have a moment. Thanks again!

Copy link
Contributor

@asl3 asl3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants