Data Migration Project

Overview

This application consists of a simple pipeline to perform a data migration. It takes CSV files with specific schemas as input and inserts the content into hive tables in AVRO format. It uses PySpark and Airflow throughout the process.

Platforms and tools

The data will be stored using Hive-Hadoop. The 3 tables used for the migration can be created with the scripts contained in the folder create_table_scripts.

The pipeline runs on Airflow. A Spark Operator takes care of reading the data from CSV files. It performs a data types check, and after confirming the compatibility, the data is inserted into the corresponding tables.

Scheduling and file handling

The pipeline runs every day at 12 noon. It checks for new files in the folder input/. After reading each file, if the file data types were compatible and the data was inserted, the file is moved to the folder processed/. On the other hand, if the file data types were incorrect, the file is moved to the folder skipped.

Stack / Technologies

Pyspark
SQL
Hadoop
Airflow
CSV
AVRO

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Queries of interest		Queries of interest
create_table_scripts		create_table_scripts
input		input
processed		processed
skipped		skipped
tasks		tasks
README.md		README.md
dagfile.py		dagfile.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Migration Project

Overview

Platforms and tools

Scheduling and file handling

Stack / Technologies

About

Uh oh!

Releases

Packages

Languages

teroxrr/Data-Migration

Folders and files

Latest commit

History

Repository files navigation

Data Migration Project

Overview

Platforms and tools

Scheduling and file handling

Stack / Technologies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages