Nearly every streaming analytics system stores processed data somewhere for further analysis, historical reference, or integration with BI tools.
In this example project, we incorporate a relational data store. We use SQLite, but the example could be altered to work with MySQL, PostgreSQL, or MongoDB.
- Black Formatter by Microsoft
- Markdown All in One by Yu Zhang
- PowerShell by Microsoft (on Windows Machines)
- Pylance by Microsoft
- Python by Microsoft
- Python Debugger by Microsoft
- Ruff by Astral Software (Linter)
- SQLite Viewer by Florian Klampfer
- WSL by Microsoft (on Windows Machines)
Before starting, ensure you have completed the setup tasks in https://github.com/denisecase/buzzline-01-case and https://github.com/denisecase/buzzline-02-case first.
Versions matter. Python 3.11 is required. See the instructions for the required Java JDK and more.
Once the tools are installed, copy/fork this project into your GitHub account and create your own version of this project to run and experiment with. Follow the instructions in FORK-THIS-REPO.md.
OR: For more practice, add these example scripts or features to your earlier project. You'll want to check requirements.txt, .env, and the consumers, producers, and util folders. Use your README.md to record your workflow and commands.
Follow the instructions in MANAGE-VENV.md to:
- Create your .venv
- Activate .venv
- Install the required dependencies using requirements.txt.
If Zookeeper and Kafka are not already running, you'll need to restart them. See instructions at [SETUP-KAFKA.md] to:
This will take two more terminals:
- One to run the producer which writes messages.
- Another to run the consumer which reads messages, processes them, and writes them to a data store.
Start the producer to generate the messages. The existing producer writes messages to a live data file in the data folder. If Zookeeper and Kafka services are running, it will try to write them to a Kafka topic as well. For configuration details, see the .env file.
In VS Code, open a NEW terminal. Use the commands below to activate .venv, and start the producer.
Windows:
.venv\Scripts\activate
py -m producers.producer_case
Mac/Linux:
source .venv/bin/activate
python3 -m producers.producer_case
The producer will still work if Kafka is not available.
Start an associated consumer. You have two options.
- Start the consumer that reads from the live data file.
- OR Start the consumer that reads from the Kafka topic.
In VS Code, open a NEW terminal in your root project folder. Use the commands below to activate .venv, and start the consumer.
Windows:
.venv\Scripts\activate
py -m consumers.kafka_consumer_case
OR
py -m consumers.file_consumer_case
Mac/Linux:
source .venv/bin/activate
python3 -m consumers.kafka_consumer_case
OR
python3 -m consumers.file_consumer_case
Review the requirements.txt file.
- What - if any - new requirements do we need for this project?
- Note that requirements.txt now lists both kafka-python and six.
- What are some common dependencies as we incorporate data stores into our streaming pipelines?
Review the .env file with the environment variables.
- Why is it helpful to put some settings in a text file?
- As we add database access and passwords, we start to keep two versions:
- .evn
- .env.example
- Read the notes in those files - which one is typically NOT added to source control?
- How do we ignore a file so it doesn't get published in GitHub (hint: .gitignore)
Review the .gitignore file.
- What new entry has been added?
Review the code for the producer and the two consumers.
- Understand how the information is generated by the producer.
- Understand how the different consumers read, process, and store information in a data store?
Compare the consumer that reads from a live data file and the consumer that reads from a Kafka topic.
- Which functions are the same for both?
- Which parts are different?
What files are in the utils folder?
- Why bother breaking functions out into utility modules?
- Would similar streaming projects be likely to take advantage of any of these files?
What files are in the producers folder?
- How do these compare to earlier projects?
- What has been changed?
- What has stayed the same?
What files are in the consumers folder?
- This is where the processing and storage takes place.
- Why did we make a separate file for reading from the live data file vs reading from the Kafka file?
- What functions are in each?
- Are any of the functions duplicated?
- Can you refactor the project so we could write a duplicated function just once and reuse it?
- What functions are in the sqlite script?
- What functions might be needed to initialize a different kind of data store?
- What functions might be needed to insert a message into a different kind of data store?
- Did you run the kafka consumer or the live file consumer? Why?
- Can you use the examples to add a database to your own streaming applications?
- What parts are most interesting to you?
- What parts are most challenging?
When resuming work on this project:
- Open the folder in VS Code.
- Open a terminal and start the Zookeeper service. If Windows, remember to start wsl.
- Open a terminal and start the Kafka service. If Windows, remember to start wsl.
- Open a terminal to start the producer. Remember to activate your local project virtual environment (.env).
- Open a terminal to start the consumer. Remember to activate your local project virtual environment (.env).
To save disk space, you can delete the .venv folder when not actively working on this project. You can always recreate it, activate it, and reinstall the necessary packages later. Managing Python virtual environments is a valuable skill.
This project is licensed under the MIT License as an example project. You are encouraged to fork, copy, explore, and modify the code as you like. See the LICENSE file for more.