Sumeh is a unified data quality validation framework supporting multiple backends (PySpark, Dask, Polars, DuckDB) with centralized rule configuration.
# Using pip
pip install sumeh
# Or with conda-forge
conda install -c conda-forge sumeh
Prerequisites:
- Python 3.10+
- One or more of:
pyspark
,dask[dataframe]
,polars
,duckdb
,cuallee
report(df, rules, name="Quality Check")
Apply your validation rules over any DataFrame (Pandas, Spark, Dask, Polars, or DuckDB).validate(df, rules)
(per-engine)
Returns a DataFrame with adq_status
column listing violations.summarize(qc_df, rules, total_rows)
(per-engine)
Consolidates violations into a summary report.
Each engine implements the validate()
+ summarize()
pair:
Engine | Module | Status |
---|---|---|
PySpark | sumeh.engine.pyspark_engine |
✅ Fully implemented |
Dask | sumeh.engine.dask_engine |
✅ Fully implemented |
Polars | sumeh.engine.polars_engine |
✅ Fully implemented |
DuckDB | sumeh.engine.duckdb_engine |
✅ Fully implemented |
Pandas | sumeh.engine.pandas_engine |
🔧 Stub implementation |
BigQuery (SQL) | sumeh.engine.bigquery_engine |
🔧 Stub implementation |
Load rules from CSV, S3, MySQL, Postgres, BigQuery table, or AWS Glue:
from sumeh.services.config import (
get_config_from_csv,
get_config_from_s3,
get_config_from_mysql,
get_config_from_postgresql,
get_config_from_bigquery,
get_config_from_glue_data_catalog,
)
rules = get_config_from_csv("rules.csv", delimiter=";")
from sumeh import report
from sumeh.engine.polars_engine import validate, summarize
import polars as pl
# 1) Load data
df = pl.read_csv("data.csv")
# 2) Run validation
qc_df = validate(df, rules)
# 3) Generate summary
total = df.height
report = summarize(qc_df, rules, total)
print(report)
Or simply:
from sumeh import report
report = report(df, rules, name="My Check")
{
"field": "customer_id",
"check_type": "is_complete",
"threshold": 0.99,
"value": null,
"execute": true
}
Sumeh supports a wide variety of validation checks including:
- Completeness checks (
is_complete
,are_complete
) - Uniqueness checks (
is_unique
,are_unique
,is_primary_key
,is_composite_key
) - Value comparisons (
is_greater_than
,is_less_than
,is_equal
,is_between
) - Set operations (
is_contained_in
,not_contained_in
) - Pattern matching (
has_pattern
) - Statistical checks (
has_min
,has_max
,has_mean
,has_std
,has_sum
) - Date validations (
is_today
,is_yesterday
,is_on_weekday
, etc.) - Custom expressions (
satisfies
)
sumeh/
├── poetry.lock
├── pyproject.toml
├── README.md
├── sumeh
│ ├── __init__.py
│ ├── cli.py
│ ├── core.py
│ ├── engine
│ │ ├── __init__.py
│ │ ├── bigquery_engine.py
│ │ ├── dask_engine.py
│ │ ├── duckdb_engine.py
│ │ ├── polars_engine.py
│ │ └── pyspark_engine.py
│ └── services
│ ├── __init__.py
│ ├── config.py
│ ├── index.html
│ └── utils.py
└── tests
├── __init__.py
├── mock
│ ├── config.csv
│ └── data.csv
├── test_dask_engine.py
├── test_duckdb_engine.py
├── test_polars_engine.py
├── test_pyspark_engine.py
└── test_sumeh.py
- Complete BigQuery engine implementation
- Complete Pandas engine implementation
- Enhanced documentation
- More validation rule types
- Performance optimizations
- Fork & create a feature branch
- Implement new checks or engines, following existing signatures
- Add tests under
tests/
- Open a PR and ensure CI passes
Licensed under the Apache License 2.0.