This project demonstrates how to build a Data Lakehouse using modern open-source technologies.
It integrates Kafka, Spark Streaming, Apache Iceberg, Apache Nessie, Trino, and DBT to create an efficient data pipeline.
🔹 Streaming Layer: Spark Streaming / Flink processes real-time Kafka events.
🔹 Batch Layer: Apache Spark and AirByte handle batch ingestion into an S3 Minio data lake.
🔹 Metadata Layer: Apache Nessie is used as the Iceberg catalog, backed by PostgreSQL.
🔹 Transformation Layer: Trino and DBT transform and store data in Vertica for analytics.
🔹 Orchestration: Apache Airflow schedules and manages all ETL/ELT workflows.
🔹 Synthetic Data: Generated with sdv
Python library from SalesDB_v1 dataset.
🔹 Visualization: Dashboards built in Tableau, Superset, or Power BI.
Category | Technology |
---|---|
Streaming | Kafka, Spark Streaming, Flink |
Storage | S3 Minio (Apache Iceberg) |
Metadata | Apache Nessie, PostgreSQL |
ETL/ELT | Airflow, DBT, Apache Spark |
Query Engine | Trino, Vertica |
Orchestration | Apache Airflow |
Dashboarding | Tableau, Superset, Power BI |
- Streaming events flow from Kafka into Spark Streaming / Flink.
- Batch data is ingested via AirByte into S3 Minio (Iceberg format).
- Raw (
raw
) and Operational (ods
) data are stored in S3 Minio (Apache Iceberg). - Apache Nessie acts as the metadata catalog, tracking schema versions and changes.
- DBT + Trino handle transformations.
- Processed marts (
marts
) are stored in Vertica for BI consumption.
- Apache Airflow schedules all ETL/ELT jobs.
- Dashboards are created using Tableau, Superset, or Power BI.
🔹 Kafka UI (Monitor real-time events)
🔹 S3 Minio Browser (View stored Iceberg tables)
🔹 Apache Nessie UI (Track schema changes)
🔹 Airflow DAGs (Monitor ETL workflows)
🔹 DBT Lineage Graph (View transformation dependencies)
🔹 BI Dashboards (Analytics & insights)
This project is fully containerized using Docker Compose.
- Docker & Docker Compose
- Python (for data generation)
git clone https://github.com/Turakulov/datalakehouse.git
cd datalakehouse
docker-compose build
docker-compose up -d
- Minio Console: http://localhost:9001
- Trino UI: http://localhost:8083
- Kafka UI: http://localhost:9999
- Airflow UI: http://localhost:8090
- Vertica (SQL Access):
jdbc:vertica://localhost:5433/db
📂 datalakehouse
├── 📂 airflow/ # Apache Airflow DAGs
├── 📂 spark/ # Spark Streaming jobs
├── 📂 kafka/ # Kafka configurations
├── 📂 trino/ # Trino catalogs
├── 📂 minio/ # Minio storage setup
├── 📂 nessie/ # Apache Nessie metadata
├── 📂 postgres/ # Postgres storage setup for Apache Iceberg metadata
├── 📂 vertica/ # Vertica storage setup for datamarts and OLAP queries
├── 📜 docker-compose.yaml # Docker environment
└── 📜 README.md # Project documentation
🔹 Fork the repo & create a feature branch.
🔹 Submit a pull request with detailed changes.