This project focuses on text classification and topic modeling for text hotels reviews data. The goal is to preprocess, analyze, and cluster financial reviews using Natural Language Processing (NLP) techniques, such as TF-IDF, embeddings, clusters and LDA.
FINAM_TOPIC_MODELLING/
│── .venv/ # Virtual environment
│── data/
│ ├── interim/ # Intermediate processed data
│ ├── processed/ # Final processed data
│ ├── raw/ # Raw datasets
│── models/ # Trained models
│── notebooks/ # Jupyter Notebooks for analysis
│── reports/ # Reports and visualizations
│── .gitignore # Git ignore file
│── pyproject.toml # PDM project configuration
│── pdm.lock # Pinned dependencies
│── requirements.txt # Dependencies for the project
│── README.md # Project documentation
│── solution.md # Project findings and insights
This project uses PDM for dependency management. To set up the environment:
pdm install
If you prefer using pip
, install dependencies from:
pip install -r requirements.txt
- Raw Data: Located in
data/raw/
- Processed Data: Cleaned and transformed data in
data/processed/
- Intermediate Data: Stored in
data/interim/
during preprocessing
The notebooks/
directory contains step-by-step analyses:
00_first_small_eda.ipynb
- Initial exploratory data analysis01_cleaning_lemmatization.ipynb
- Text preprocessing pipeline02_process_embedding.ipynb
- Generating embeddings03_clustering_analyses.ipynb
- Clustering financial reviews04_topic_modelling_with_lda.ipynb
- LDA topic modeling
Trained models are stored in models/
. The workflow includes:
- Text preprocessing (tokenization, stopword removal, lemmatization)
- Feature extraction (TF-IDF, Word2Vec, BERT embeddings)
- Clustering techniques (K-Means, HDBSCAN)
- Topic modeling using LDA (Latent Dirichlet Allocation)
To execute preprocessing and modeling:
pdm run python scripts/preprocess.py
pdm run python scripts/train_model.py
reports/html/
contains interactive visualizationsreports/ipynb/
contains Jupyter-based reports
- Python 3.x
- pandas, numpy, scikit-learn
- nltk, spacy, gensim
- pyLDAvis, seaborn, matplotlib
- transformers, torch (for embeddings)
- Clone the repository:
git clone https://github.com/your-repo/FINAM_TOPIC_MODELLING.git
- Create a new branch:
git checkout -b feature-branch
- Commit changes and push:
git commit -m "Added new feature" git push origin feature-branch
- Submit a pull request.
MIT License