🏗️ Data Lakehouse with Modern Technologies

📌 Overview

This project demonstrates how to build a Data Lakehouse using modern open-source technologies.
It integrates Kafka, Spark Streaming, Apache Iceberg, Apache Nessie, Trino, and DBT to create an efficient data pipeline.

🔹 Streaming Layer: Spark Streaming / Flink processes real-time Kafka events.
🔹 Batch Layer: Apache Spark and AirByte handle batch ingestion into an S3 Minio data lake.
🔹 Metadata Layer: Apache Nessie is used as the Iceberg catalog, backed by PostgreSQL.
🔹 Transformation Layer: Trino and DBT transform and store data in Vertica for analytics.
🔹 Orchestration: Apache Airflow schedules and manages all ETL/ELT workflows.
🔹 Synthetic Data: Generated with sdv Python library from SalesDB_v1 dataset.
🔹 Visualization: Dashboards built in Tableau, Superset, or Power BI.

🚀 Tech Stack

Category	Technology
Streaming	Kafka, Spark Streaming, Flink
Storage	S3 Minio (Apache Iceberg)
Metadata	Apache Nessie, PostgreSQL
ETL/ELT	Airflow, DBT, Apache Spark
Query Engine	Trino, Vertica
Orchestration	Apache Airflow
Dashboarding	Tableau, Superset, Power BI

🎯 Architecture

1️⃣ Data Ingestion

Streaming events flow from Kafka into Spark Streaming / Flink.
Batch data is ingested via AirByte into S3 Minio (Iceberg format).

2️⃣ Storage & Metadata Management

Raw (raw) and Operational (ods) data are stored in S3 Minio (Apache Iceberg).
Apache Nessie acts as the metadata catalog, tracking schema versions and changes.

3️⃣ Transformation & Querying

DBT + Trino handle transformations.
Processed marts (marts) are stored in Vertica for BI consumption.

4️⃣ Orchestration & Visualization

Apache Airflow schedules all ETL/ELT jobs.
Dashboards are created using Tableau, Superset, or Power BI.

📸 Screenshots

🔹 Kafka UI (Monitor real-time events)

🔹 S3 Minio Browser (View stored Iceberg tables)
🔹 Apache Nessie UI (Track schema changes)
🔹 Airflow DAGs (Monitor ETL workflows)
🔹 DBT Lineage Graph (View transformation dependencies)
🔹 BI Dashboards (Analytics & insights)

🏗️ Setup & Deployment

This project is fully containerized using Docker Compose.

🔧 Prerequisites

Docker & Docker Compose
Python (for data generation)

📥 Clone Repository

git clone https://github.com/Turakulov/datalakehouse.git
cd datalakehouse

▶️ Start the Environment

docker-compose build

docker-compose up -d

🔎 Verify Services

Minio Console: http://localhost:9001
Trino UI: http://localhost:8083
Kafka UI: http://localhost:9999
Airflow UI: http://localhost:8090
Vertica (SQL Access): jdbc:vertica://localhost:5433/db

🛠️ Project Structure

📂 datalakehouse
├── 📂 airflow/        # Apache Airflow DAGs
├── 📂 spark/          # Spark Streaming jobs
├── 📂 kafka/          # Kafka configurations
├── 📂 trino/          # Trino catalogs
├── 📂 minio/          # Minio storage setup
├── 📂 nessie/         # Apache Nessie metadata
├── 📂 postgres/       # Postgres storage setup for Apache Iceberg metadata
├── 📂 vertica/        # Vertica storage setup for datamarts and OLAP queries
├── 📜 docker-compose.yaml  # Docker environment
└── 📜 README.md       # Project documentation

📌 Contributing

🔹 Fork the repo & create a feature branch.
🔹 Submit a pull request with detailed changes.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.github		.github
airflow		airflow
flink		flink
kafka		kafka
spark		spark
synthetic_data_vault		synthetic_data_vault
trino/catalog		trino/catalog
vertica		vertica
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
docker-compose.yaml		docker-compose.yaml
entrypoint.sh		entrypoint.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏗️ Data Lakehouse with Modern Technologies

📌 Overview

🚀 Tech Stack

🎯 Architecture

1️⃣ Data Ingestion

2️⃣ Storage & Metadata Management

3️⃣ Transformation & Querying

4️⃣ Orchestration & Visualization

📸 Screenshots

🏗️ Setup & Deployment

🔧 Prerequisites

📥 Clone Repository

▶️ Start the Environment

🔎 Verify Services

🛠️ Project Structure

📌 Contributing

About

Releases

Packages

Languages

Turakulov/datalakehouse

Folders and files

Latest commit

History

Repository files navigation

🏗️ Data Lakehouse with Modern Technologies

📌 Overview

🚀 Tech Stack

🎯 Architecture

1️⃣ Data Ingestion

2️⃣ Storage & Metadata Management

3️⃣ Transformation & Querying

4️⃣ Orchestration & Visualization

📸 Screenshots

🏗️ Setup & Deployment

🔧 Prerequisites

📥 Clone Repository

▶️ Start the Environment

🔎 Verify Services

🛠️ Project Structure

📌 Contributing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages