FINAM_HOTEL_REVIEWS

Project Overview

This project focuses on text classification and topic modeling for text hotels reviews data. The goal is to preprocess, analyze, and cluster financial reviews using Natural Language Processing (NLP) techniques, such as TF-IDF, embeddings, clusters and LDA.

Directory Structure

FINAM_TOPIC_MODELLING/
│── .venv/                   # Virtual environment
│── data/
│   ├── interim/             # Intermediate processed data
│   ├── processed/           # Final processed data
│   ├── raw/                 # Raw datasets
│── models/                  # Trained models
│── notebooks/               # Jupyter Notebooks for analysis
│── reports/                 # Reports and visualizations
│── .gitignore               # Git ignore file
│── pyproject.toml           # PDM project configuration
│── pdm.lock                 # Pinned dependencies
│── requirements.txt         # Dependencies for the project
│── README.md                # Project documentation
│── solution.md              # Project findings and insights

Installation

This project uses PDM for dependency management. To set up the environment:

pdm install

If you prefer using pip, install dependencies from:

pip install -r requirements.txt

Data Sources

Raw Data: Located in data/raw/
Processed Data: Cleaned and transformed data in data/processed/
Intermediate Data: Stored in data/interim/ during preprocessing

Notebooks

The notebooks/ directory contains step-by-step analyses:

00_first_small_eda.ipynb - Initial exploratory data analysis
01_cleaning_lemmatization.ipynb - Text preprocessing pipeline
02_process_embedding.ipynb - Generating embeddings
03_clustering_analyses.ipynb - Clustering financial reviews
04_topic_modelling_with_lda.ipynb - LDA topic modeling

Model Training

Trained models are stored in models/. The workflow includes:

Text preprocessing (tokenization, stopword removal, lemmatization)
Feature extraction (TF-IDF, Word2Vec, BERT embeddings)
Clustering techniques (K-Means, HDBSCAN)
Topic modeling using LDA (Latent Dirichlet Allocation)

Running the Project

To execute preprocessing and modeling:

pdm run python scripts/preprocess.py
pdm run python scripts/train_model.py

Results & Reports

reports/html/ contains interactive visualizations
reports/ipynb/ contains Jupyter-based reports

Dependencies

Python 3.x
pandas, numpy, scikit-learn
nltk, spacy, gensim
pyLDAvis, seaborn, matplotlib
transformers, torch (for embeddings)

Contribution

Clone the repository:

git clone https://github.com/your-repo/FINAM_TOPIC_MODELLING.git

Create a new branch:
```
git checkout -b feature-branch
```

Commit changes and push:

git commit -m "Added new feature"
git push origin feature-branch

Submit a pull request.

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FINAM_HOTEL_REVIEWS

Project Overview

Directory Structure

Installation

Data Sources

Notebooks

Model Training

Running the Project

Results & Reports

Dependencies

Contribution

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
data		data
models		models
notebooks		notebooks
reports		reports
.gitignore		.gitignore
.pdm-python		.pdm-python
README.md		README.md
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
solution.md		solution.md

ilbaks/finam_hotel_reviews

Folders and files

Latest commit

History

Repository files navigation

FINAM_HOTEL_REVIEWS

Project Overview

Directory Structure

Installation

Data Sources

Notebooks

Model Training

Running the Project

Results & Reports

Dependencies

Contribution

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages