This project provides an isolated development environment for Spark, Kafka, and PySpark using local JDK and virtual environments.
Works across macOS, Linux, and Windows (via WSL).
Install Windows Subsystem for Linux (WSL) by following the instructions.
Open WSL by opening a PowerShell terminal and running wsl.
wsl
Important: All remaining commands should be run from within the WSL environment. We will use the same ones the Mac/Linux users do when we are working in WSL.
- Copy the template repo into your GitHub account. You can change the name as desired.
- Open a terminal in your "Projects" folder or where ever you keep your coding projects.
- Avoid using "Documents" or any folder that syncs automatically to OneDrive or other cloud services.
- Clone this repository into that folder - Windows users - clone into your default WSL directory.
In the command below, if you changed the repository name, use that name instead.
For example - clone with something like this - but use your GitHub account name and repo name:
git clone https://github.com/denisecase/pro-analytics-apache-starter
Then cd into your new folder (if you changed the name, use that):
``shell cd pro-analytics-apache-starter
### All Platforms: Adjust Requirements (Packages Needed)
Review requirements.txt and comment / uncomment the specific packages needed for your project.
---
## Create Virtual Environment
```shell
python3 -m venv .venv
source .venv/bin/activate
Important Reminder: Always run source .venv/bin/activate
before working on the project.
python3 -m pip install --upgrade pip setuptools wheel
python3 -m pip install --upgrade -r requirements.txt
Verify compatible versions. See instructions in the file. Then, install the necessary OpenJDK locally.
./01-setup/download-jdk.sh
Use the commands below to install only the tools your project requires:
./01-setup/install-kafka.sh
./01-setup/install-pyspark.sh
Start the Kafka service (keep this terminal running)
./02-scripts/run-kafka.sh
In a second terminal, create a Kafka topic
./kafka/bin/kafka-topics.sh --create --topic test-topic --bootstrap-server localhost:9092
In that second terminal, list Kafka topics
./kafka/bin/kafka-topics.sh --list --bootstrap-server localhost:9092
In that second terminal, stop the Kafka service when done working with Kafka. Use whichever works.
./kafka/bin/kafka-server-stop.sh
pkill -f kafka
Start PySpark (leave this terminal running)
./02-scripts/run-pyspark.sh
In a second terminal, test Spark
python 02-scripts/test-spark.py
Use that second terminal to stop the service when done:
pkill -f pyspark