[SPARK-51847][PYTHON] Extend PySpark testing framework util functions with basic data tests #50644

stanlocht · 2025-04-18T16:41:48Z

What changes were proposed in this pull request?

This PR extends the PySpark testing framework with four new utility functions for data quality and integrity testing:

assertColumnUnique: Verifies that specified column(s) contain only unique values
assertColumnNonNull: Checks that specified column(s) do not contain null values
assertColumnValuesInSet: Ensures all values in specified column(s) are within a given set of accepted values
assertReferentialIntegrity: Validates that all non-null values in a source column exist in a target column (similar to foreign key constraints)

Why are the changes needed?

These new utility functions address this gap by providing standardized, well-tested implementations of the most common data quality checks. They reduce boilerplate code, improve test readability, and enable testing patterns similar to those in popular data testing frameworks like dbt.

Does this PR introduce any user-facing change?

Yes, this PR introduces new public utility functions in the pyspark.testing module. These are additive changes that don't modify existing functionality.

Example usage:

from pyspark.testing import assertColumnUnique, assertReferentialIntegrity

# Check that 'id' column contains only unique values
assertColumnUnique(df, "id")

# Check that all customer_ids in orders exist in customers.id
assertReferentialIntegrity(orders, "customer_id", customers, "id")

How was this patch tested?

Comprehensive tests were added for all new functions in python/pyspark/sql/tests/test_utils.py. The tests cover:

Basic functionality with valid inputs
Error cases with invalid inputs
Edge cases (e.g., null values, empty DataFrames)
Different DataFrame types (Spark, pandas, pandas-on-Spark)
Detailed validation of error messages

Each function has multiple test methods that verify both positive and negative test cases. For example, assertReferentialIntegrity has tests for valid relationships, invalid relationships with a single missing value, multiple missing values, and proper handling of null values.

All tests pass on the current master branch.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude 3.7 Sonnet

…tility functions

HyukjinKwon · 2025-04-20T23:27:27Z

cc @asl3 fyi

zhengruifeng · 2025-04-21T03:18:41Z

remove [TESTS] from title since it seems a new user-facing feature

stanlocht · 2025-05-01T11:55:24Z

Hi @HyukjinKwon, @zhengruifeng, @asl3 — just following up to see if you might have a chance to review the PR when time allows. Appreciate your time and input!

asl3

thank you for the contributions! these will be helpful test utils

python/pyspark/testing/utils.py

python/pyspark/sql/tests/test_utils.py

python/pyspark/testing/utils.py

python/pyspark/sql/tests/test_utils.py

stanlocht · 2025-05-09T15:37:09Z

@asl3 thanks for you review! i've implemented your suggestions. would be great if you could check it out when you have the time!

stanlocht · 2025-05-22T13:52:57Z

Hi @HyukjinKwon, @zhengruifeng, @asl3 — just checking in on this PR. I’ve made the changes based on the earlier feedback, so let me know if there’s anything else you’d like to see. Would be great to get this moving if/when you have a moment. Thanks again!

asl3

LGTM, thanks!!

stanlocht and others added 6 commits April 18, 2025 15:52

add unique columns test

4f1dee2

add no nulls in column test

c9c86e1

add value in set test

d900f07

fix input validation

c534682

add referential integrity tests

76c5152

Merge branch 'apache:master' into feature/more-pyspark-tests

05814c5

github-actions bot added SQL PYTHON labels Apr 18, 2025

stanlocht added 4 commits April 18, 2025 18:47

Trigger GitHub Actions

00a8a63

Fix linting issues in assertReferentialIntegrity function

dda13dc

Fix linting issues in assertReferentialIntegrity function and other u…

9c46594

…tility functions

Reformatting

992b826

zhengruifeng changed the title ~~[SPARK-51847][PYTHON][TESTS] Extend PySpark testing framework util functions with basic data tests~~ [SPARK-51847][PYTHON] Extend PySpark testing framework util functions with basic data tests Apr 21, 2025

stanlocht added 2 commits April 23, 2025 16:43

Formatting changes

5f99df8

Black formatting

6688273

asl3 reviewed May 2, 2025

View reviewed changes

stanlocht added 6 commits May 2, 2025 15:14

remove empty lines between imports

9e8ceb1

Refactor column unique test to the pandas utils

5f57e7f

refactor column non null

a744d30

refactor valid values test

bdf3c85

refactor referential integrity test

739c6b7

add more tests for different df types

5bb69d0

stanlocht added 4 commits May 9, 2025 22:54

fix test

61b8489

fix failing test on mixed types

2722994

shorten missed lines by formatter

dc77c5f

fix trailing white space

970e50f

asl3 approved these changes May 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-51847][PYTHON] Extend PySpark testing framework util functions with basic data tests #50644

[SPARK-51847][PYTHON] Extend PySpark testing framework util functions with basic data tests #50644

Uh oh!

stanlocht commented Apr 18, 2025

Uh oh!

HyukjinKwon commented Apr 20, 2025

Uh oh!

zhengruifeng commented Apr 21, 2025

Uh oh!

stanlocht commented May 1, 2025

Uh oh!

asl3 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stanlocht commented May 9, 2025

Uh oh!

stanlocht commented May 22, 2025

Uh oh!

asl3 left a comment

Uh oh!

Uh oh!

[SPARK-51847][PYTHON] Extend PySpark testing framework util functions with basic data tests #50644

Are you sure you want to change the base?

[SPARK-51847][PYTHON] Extend PySpark testing framework util functions with basic data tests #50644

Uh oh!

Conversation

stanlocht commented Apr 18, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon commented Apr 20, 2025

Uh oh!

zhengruifeng commented Apr 21, 2025

Uh oh!

stanlocht commented May 1, 2025

Uh oh!

asl3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stanlocht commented May 9, 2025

Uh oh!

stanlocht commented May 22, 2025

Uh oh!

asl3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!