Skip to content

Add save_results_in_table function #319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

karthik-ballullaya-db
Copy link
Member

Changes

Provide functionality to write output and quarantine data to tables as defined in the config.

Linked issues

Resolves #167

Tests

  • manually tested
  • added unit tests
  • added integration tests

def save_results_in_table(
self,
quarantine_df: DataFrame | None = None,
output_df: DataFrame | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add overwrite: bool = True option so that new data can also be appended.

assume_user: bool = True,
):
"""
Save quarantine and output data to the `quarantine_table` and `output_table` mentioned in the installation config file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Save quarantine and output data to the `quarantine_table` and `output_table` mentioned in the installation config file.
Save quarantine and output data to the `quarantine_table` and `output_table` defined in the installation config file.

@@ -618,6 +618,38 @@ def save_checks_in_installation(
)
installation.upload(run_config.checks_file, yaml.safe_dump(checks).encode('utf-8'))

def save_results_in_table(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update documentation and tool demo accordingly.

@@ -0,0 +1,76 @@
import pytest
Copy link
Contributor

@mwojtyczka mwojtyczka May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to unit tests (tests/unit folder). I would generally avoid extensive mocking as it makes refactoring later difficult and tests hard to understand. Also please don't relay on private methods for testing.

For integration testing please setup a proper dataframe and make sure the data is actually saved in the tables as defined in the config. We have plenty of test examples where we setup and use a config.

mock_run_config = MagicMock()
mock_run_config.quarantine_table = "quarantine_table"
mock_run_config.output_table = None
mock_engine._load_run_config.return_value = mock_run_config
Copy link
Contributor

@mwojtyczka mwojtyczka May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use private methods in tests. It breaks encapsulation.

@mwojtyczka mwojtyczka requested a review from Copilot May 20, 2025 15:49
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a new save_results_in_table method to DQEngine that writes quarantine and output DataFrames to Delta tables based on the run configuration, and includes integration tests to verify its behavior.

  • Introduces save_results_in_table in engine.py with parameters for quarantine and output DataFrames, run config name, product, and install assumption.
  • Writes DataFrames to Delta tables in overwrite or append mode, guarded by presence of corresponding table names in the config.
  • Adds integration tests covering scenarios: quarantine only, output only, both, and neither.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/databricks/labs/dqx/engine.py Added save_results_in_table method with Delta writes
tests/integration/test_save_results_in_table.py Added integration tests for the new method
Comments suppressed due to low confidence (1)

tests/integration/test_save_results_in_table.py:67

  • This test checks config loading but does not verify that no write operations occur on DataFrames. Consider adding assertions that neither quarantine_df.write nor output_df.write is called when both are None.
def test_save_results_in_table_no_data(mock_engine):

@@ -0,0 +1,76 @@
import pytest
from unittest.mock import MagicMock, patch
Copy link
Preview

Copilot AI May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The patch import is never used in this file. Consider removing it to clean up unused imports.

Suggested change
from unittest.mock import MagicMock, patch
from unittest.mock import MagicMock

Copilot uses AI. Check for mistakes.

Comment on lines +641 to +651
if quarantine_df is not None and run_config.quarantine_table:
logger.info(
f"Saving quarantine data to {run_config.quarantine_table} table"
)
quarantine_df.write.format("delta").mode("overwrite").saveAsTable(run_config.quarantine_table)

if output_df is not None and run_config.output_table:
logger.info(
f"Saving output data to {run_config.output_table} table"
)
output_df.write.format("delta").mode("append").saveAsTable(run_config.output_table)
Copy link
Preview

Copilot AI May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The write logic for quarantine and output DataFrames is nearly identical except for mode and table name. Consider extracting a helper function to reduce duplication and improve readability.

Suggested change
if quarantine_df is not None and run_config.quarantine_table:
logger.info(
f"Saving quarantine data to {run_config.quarantine_table} table"
)
quarantine_df.write.format("delta").mode("overwrite").saveAsTable(run_config.quarantine_table)
if output_df is not None and run_config.output_table:
logger.info(
f"Saving output data to {run_config.output_table} table"
)
output_df.write.format("delta").mode("append").saveAsTable(run_config.output_table)
def _save_dataframe(self, df: DataFrame, table_name: str, mode: str, description: str):
"""
Helper method to save a DataFrame to a Delta table.
:param df: The DataFrame to save
:param table_name: The name of the Delta table
:param mode: The write mode (e.g., "overwrite", "append")
:param description: Description of the data being saved (for logging purposes)
"""
logger.info(f"Saving {description} to {table_name} table")
df.write.format("delta").mode(mode).saveAsTable(table_name)
if quarantine_df is not None and run_config.quarantine_table:
self._save_dataframe(
df=quarantine_df,
table_name=run_config.quarantine_table,
mode="overwrite",
description="quarantine data"
)
if output_df is not None and run_config.output_table:
self._save_dataframe(
df=output_df,
table_name=run_config.output_table,
mode="append",
description="output data"
)

Copilot uses AI. Check for mistakes.

@mwojtyczka mwojtyczka mentioned this pull request May 20, 2025
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE]: Provide option to write output and quarantine data to tables as defined in the config
2 participants