Create Spark DataFrame comparison using SparkSQLCompare.py #330

divithraju · 2024-09-11T14:15:55Z

Description:

This PR introduces functionality to compare two Spark DataFrames using the SparkSQLCompare class from DataComPy. The new feature supports efficient comparison of large datasets directly using PySpark, which should provide better performance for Spark users. This change aligns with the existing API for DataFrame comparisons and expands support for Spark users by leveraging the native SparkSQL comparison logic.

Key Changes:

Added a function compare_dataframes that allows for the comparison of PySpark DataFrames using SparkSQLCompare.
Modularized code into distinct functions for Spark initialization, DataFrame creation, and comparison.
Implemented logging to provide better traceability during execution.
Error handling ensures robustness in Spark session setup and DataFrame operations.

Testing:

Basic functionality tested with sample data.
No breaking changes introduced. Existing features remain unaffected.

Motivation:

This contribution adds a missing feature for users working with PySpark, providing a more efficient comparison method using native Spark SQL logic. It also follows the DataComPy roadmap for better Spark integration and performance improvements.

Next Steps:

Extend unit tests to cover edge cases for Spark DataFrame comparison.
Optimize for large datasets with different join conditions.

Please review and provide feedback. I look forward to contributing further!

CLAassistant · 2024-09-11T15:41:24Z

All committers have signed the CLA.

fdosani · 2024-09-14T15:14:23Z

@divithraju I'm not sure what this PR is for or what it is trying to accomplish? It appears on the surface, you are just creating a simple implementation of how to use datacompy?

I'm not sure if there is a need for something like this. Most of this is captured in the documentation. Setting up a spark session, and call the compare / report methods. You're just wrapping existing public methods.

divithraju · 2024-09-20T14:17:20Z

Hi [@fdosani ],

Thank you for the feedback! I now understand that this PR may not add much new functionality since it mostly wraps the existing methods in DataComPy. My initial intention was to create an easy-to-follow implementation for Spark users.

Based on your feedback, I can:

Turn this PR into a documentation improvement with better examples and explanations.
Expand the code to include new features, such as performance improvements for large datasets or custom join conditions.

Please let me know which direction you prefer, and I’ll update the PR accordingly. Thanks again for your guidance!

fdosani · 2024-09-20T14:19:52Z

@divithraju you can take a look thorugh the issues and if something catches your eye feel free to submit based on that.

Create Spark DataFrame comparison using SparkSQLCompare.py

b9227f0

divithraju requested review from fdosani, ak-gupta, jdawang and gladysteh99 as code owners September 11, 2024 14:15

divithraju closed this Sep 11, 2024

divithraju reopened this Sep 14, 2024

fdosani closed this Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Spark DataFrame comparison using SparkSQLCompare.py #330

Create Spark DataFrame comparison using SparkSQLCompare.py #330

divithraju commented Sep 11, 2024

CLAassistant commented Sep 11, 2024 •

edited

Loading

fdosani commented Sep 14, 2024 •

edited

Loading

divithraju commented Sep 20, 2024

fdosani commented Sep 20, 2024 •

edited

Loading

Create Spark DataFrame comparison using SparkSQLCompare.py #330

Create Spark DataFrame comparison using SparkSQLCompare.py #330

Conversation

divithraju commented Sep 11, 2024

Description:

Key Changes:

Testing:

Motivation:

Next Steps:

CLAassistant commented Sep 11, 2024 • edited Loading

fdosani commented Sep 14, 2024 • edited Loading

divithraju commented Sep 20, 2024

fdosani commented Sep 20, 2024 • edited Loading

CLAassistant commented Sep 11, 2024 •

edited

Loading

fdosani commented Sep 14, 2024 •

edited

Loading

fdosani commented Sep 20, 2024 •

edited

Loading