You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
EMR + Hadoop to Redshift ELT workflow using spark steps API and orchestrated by Apache-Airflow, which ingests disparate datasets focused around 7Gb of I94 arrivals information to produce a simple star schema in Redshift
A Cloud based Reddit stock sentiment analyzer that analyzes overall sentiment from a configurable selection of stock subreddits for each stock. The architecture utilizes AWS MSK (Kafka), AWS EMR (PySpark) and AWS Lambda (Python 3) for maximum scalability and the OpenAI API for sentiment analysis through prompt engineering.
Data Pipeline Analytics Platform is an end-to-end generic Big Data pipeline. Involves following tech stack: AWS S3, AWS Redshift, AWS EMR Cluster, Apache Spark, Apache Airflow.
ETL Pipeline extracts JSON files from AWS S3 bucket and transforms these using an AWS EMR Spark Cluster and stores the data into an AWS S3 bucket in parquet file format.
Built a data model, data warehouse and pipeline for extracting transforming and loading data into a star schema-based data model in a redshift database