Skip to content

denisecase/pro-analytics-apache-starter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pro-analytics-apache-starter

This project provides an isolated development environment for Spark, Kafka, and PySpark using local JDK and virtual environments.
Works across macOS, Linux, and Windows (via WSL).


Getting Started

Just Windows Users: Set up WSL

Install Windows Subsystem for Linux (WSL) by following the instructions.

Open WSL by opening a PowerShell terminal and running wsl.

wsl

Important: All remaining commands should be run from within the WSL environment. We will use the same ones the Mac/Linux users do when we are working in WSL.

All Platforms: Clone This Repo

  1. Copy the template repo into your GitHub account. You can change the name as desired.
  2. Open a terminal in your "Projects" folder or where ever you keep your coding projects.
  3. Avoid using "Documents" or any folder that syncs automatically to OneDrive or other cloud services.
  4. Clone this repository into that folder - Windows users - clone into your default WSL directory.

In the command below, if you changed the repository name, use that name instead.

For example - clone with something like this - but use your GitHub account name and repo name:

git clone https://github.com/denisecase/pro-analytics-apache-starter

Then cd into your new folder (if you changed the name, use that):

``shell cd pro-analytics-apache-starter



### All Platforms: Adjust Requirements (Packages Needed)  
Review requirements.txt and comment / uncomment the specific packages needed for your project.  

---

## Create Virtual Environment

```shell
python3 -m venv .venv
source .venv/bin/activate

Important Reminder: Always run source .venv/bin/activate before working on the project.

Install Requirements

python3 -m pip install --upgrade pip setuptools wheel
python3 -m pip install --upgrade -r requirements.txt

Install JDK

Verify compatible versions. See instructions in the file. Then, install the necessary OpenJDK locally.

./01-setup/download-jdk.sh

Install Apache Tools (As Needed)

Use the commands below to install only the tools your project requires:

./01-setup/install-kafka.sh
./01-setup/install-pyspark.sh

Example: Using Apache Kafka

Start the Kafka service (keep this terminal running)

./02-scripts/run-kafka.sh

In a second terminal, create a Kafka topic

./kafka/bin/kafka-topics.sh --create --topic test-topic --bootstrap-server localhost:9092

In that second terminal, list Kafka topics

./kafka/bin/kafka-topics.sh --list --bootstrap-server localhost:9092

In that second terminal, stop the Kafka service when done working with Kafka. Use whichever works.

./kafka/bin/kafka-server-stop.sh

pkill -f kafka

Example: Using PySpark

Start PySpark (leave this terminal running)

./02-scripts/run-pyspark.sh

In a second terminal, test Spark

python 02-scripts/test-spark.py

Use that second terminal to stop the service when done:

pkill -f pyspark

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published