Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Spark DataFrame comparison using SparkSQLCompare.py #330

Closed
wants to merge 1 commit into from

Conversation

divithraju
Copy link

Description:

This PR introduces functionality to compare two Spark DataFrames using the SparkSQLCompare class from DataComPy. The new feature supports efficient comparison of large datasets directly using PySpark, which should provide better performance for Spark users. This change aligns with the existing API for DataFrame comparisons and expands support for Spark users by leveraging the native SparkSQL comparison logic.

Key Changes:

  • Added a function compare_dataframes that allows for the comparison of PySpark DataFrames using SparkSQLCompare.
  • Modularized code into distinct functions for Spark initialization, DataFrame creation, and comparison.
  • Implemented logging to provide better traceability during execution.
  • Error handling ensures robustness in Spark session setup and DataFrame operations.

Testing:

  • Basic functionality tested with sample data.
  • No breaking changes introduced. Existing features remain unaffected.

Motivation:

This contribution adds a missing feature for users working with PySpark, providing a more efficient comparison method using native Spark SQL logic. It also follows the DataComPy roadmap for better Spark integration and performance improvements.

Next Steps:

  1. Extend unit tests to cover edge cases for Spark DataFrame comparison.
  2. Optimize for large datasets with different join conditions.

Please review and provide feedback. I look forward to contributing further!

@CLAassistant
Copy link

CLAassistant commented Sep 11, 2024

CLA assistant check
All committers have signed the CLA.

@divithraju divithraju reopened this Sep 14, 2024
@fdosani
Copy link
Member

fdosani commented Sep 14, 2024

@divithraju I'm not sure what this PR is for or what it is trying to accomplish? It appears on the surface, you are just creating a simple implementation of how to use datacompy?

I'm not sure if there is a need for something like this. Most of this is captured in the documentation. Setting up a spark session, and call the compare / report methods. You're just wrapping existing public methods.

@fdosani fdosani closed this Sep 14, 2024
@divithraju
Copy link
Author

Hi [@fdosani ],

Thank you for the feedback! I now understand that this PR may not add much new functionality since it mostly wraps the existing methods in DataComPy. My initial intention was to create an easy-to-follow implementation for Spark users.

Based on your feedback, I can:

  1. Turn this PR into a documentation improvement with better examples and explanations.
  2. Expand the code to include new features, such as performance improvements for large datasets or custom join conditions.

Please let me know which direction you prefer, and I’ll update the PR accordingly. Thanks again for your guidance!

@fdosani
Copy link
Member

fdosani commented Sep 20, 2024

@divithraju you can take a look thorugh the issues and if something catches your eye feel free to submit based on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants