Project: PySpark Data Processing Toolkit

This repository contains a collection of PySpark scripts designed to handle various data processing and analysis tasks. These scripts cover topics like sequential numbering, first purchase date determination, data reshaping, rolling averages, missing value detection, handling mixed delimiters, and missing data imputation. Each script is a standalone solution, demonstrating how PySpark can be used to solve practical data engineering problems.

Detailed Descriptions

1.Sequential Numbering by Group: Uses the row_number() function within a window specification to generate sequential numbers for each row in a group, ordered by date. This is useful for scenarios like ranking or assigning unique IDs within grouped data.

2.First Purchase Date Calculation: Aggregates user purchase data to find the earliest purchase date for each user, demonstrating how to group and summarize data.

3.Pivoting Sales Data: Transforms sales data from wide format (multiple columns for each month) to a long format, making it easier to analyze monthly trends.

4.7-Day Rolling Average Calculation: Demonstrates how to calculate rolling averages over a window of time for each product, which is essential for time-series analysis.

5.Finding Missing Numbers in a Sequence: Identifies gaps in a numeric sequence, helping to detect missing records in a dataset.

6.Handling Mixed Delimiters: Shows how to split and process data with multiple delimiters in the same row, which is common in messy or inconsistent datasets.

7.Imputing Missing Sales Amounts: Demonstrates how to handle missing values by calculating the average and using it to replace None values, ensuring data completeness.

How to Use Each Script

1.Sequential Numbering by Group (sequentail.py): Run the script to generate sequential numbers for rows within grouped data.

2.First Purchase Date Calculation (purchase.py): Run this script to calculate the first purchase date for each user in your dataset. Pivoting Sales Data (pivot.py): Use this script to convert wide sales data into a long format, creating one row per product-month combination.

3.7-Day Rolling Average Calculation (moving-avg.py): Calculate a rolling average for sales quantities over a 7-day window.

4.Finding Missing Numbers (missing.py): Detect missing numbers in a sequence, helping to identify gaps in numeric data.

5.Handling Mixed Delimiters (delimiter.py): Split and process rows with mixed delimiters using custom logic.

6.Imputing Missing Sales Amounts (avg-sales-amount.py): Replace missing values in sales data with the average amount, ensuring data consistency.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Avg-sales-amount.py		Avg-sales-amount.py
Delimiter.py		Delimiter.py
Missing.py		Missing.py
Moving-avg.py		Moving-avg.py
Pivot.py		Pivot.py
Purchase.py		Purchase.py
README.md		README.md
sequential.py		sequential.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project: PySpark Data Processing Toolkit

Contents

Detailed Descriptions

How to Use Each Script

License

About

Releases

Packages

Languages

divithraju/divith-raju-Pyspark-work

Folders and files

Latest commit

History

Repository files navigation

Project: PySpark Data Processing Toolkit

Contents

Detailed Descriptions

How to Use Each Script

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages