About this Repository ℹ️

This repository contains the project I did as a part of the coursework for ECS784P - Data Analytics.
The assignment required me to address a data-related problem in a chosen field of interest by applying at least two data analytics techniques from a given list of machine learning algorithms.

Linear, non-linear and logistic regression
Support vector classification or regression
Decision trees, with or without random forest
KNN
k-means
GMMs

Motivation ⚙️

Chosen data-related problem

"How accurately can machine learning models predict startup profitability based on funding, market share, and other business factors?"

Given the high failure rates of startups and the increasing adoption of ML in financial prediction, this study focuses on predicting startup profitability—a key financial outcome that directly influences long-term survival. Unlike previous studies that broadly assess startup success, this research aims to provide a more precise financial evaluation by examining the ability of startups to generate profit based on factors such as market share, funding, revenue, and operational scale. By identifying the most influential predictors of profitability, this study seeks to enhance data-driven decision-making for investors, entrepreneurs, and market analysts.

Chosen dataset

The dataset used was Startup Growth & Funding Trends, which included features such as funding rounds, funding amount, valuation, revenue, employees, market share, industry, and exit status. The target variable was profitability (binary: 0 = not profitable, 1 = profitable).

Chosen machine learning algorithms

Logistic Regression
- Interpretable and Efficient
  - Logistic Regression is a simple, linear classification model that provides clear insights into feature importance.
- Baseline Model
  - It serves as a strong benchmark model to compare against more complex algorithms.
- Works well for binary classification
  - Since profitability is a binary outcome (Profitable = 1, Not Profitable = 0), Logistic Regression is well-suited for this task.
- Handles small Datasets well
  - Given the dataset size (500 startups), Logistic Regression performs efficiently without the risk of overfitting.

Compared to k-means and GMM (unsupervised models), Logistic Regression is a better choice.
- These are clustering algorithms and are not suitable for direct classification tasks.
Compared to Decision Trees (single trees), Logistic Regression is a better choice.
- Logistic Regression is less prone to overfitting, making it a better baseline model.

Random Forest

Captures non-linear relationship
- Unlike Logistic Regression, Random Forest can model complex interactions between financial factors affecting profitability.
Less prone to overfitting than single decision trees
- Random Forest reduces overfitting by averaging multiple trees, making it more robust than individual decision trees.
Handles mixed data well
- Random Forest works well with a mix of continuous and categorical variables, which suits the dataset's financial and market-related features.
Feature importance ranking
- It provides insights into which financial indicators are most influenced, helping refine feature selection.

Compared to Support Vector Machines (SVM), Random Forest is a better choice.
- SVM requires careful parameter tuning and scales poorly with larger feature sets, making Random Forest a more practical choice.
Compared to k-Nearest Neighbours (KNN), Random Forest is a better choice.
- KNN struggles with high-dimensional data and large datasets, whereas Random Forest performs well in such cases.
Compared to linear and non-linear regression, Random Forest is a better choice.
- Regression models predict continuous values, whereas Random Forest is better suited for classification.

The models were trained using an 80:20 train-test split, and performance was evaluated using accuracy, AUC score, precision, recall, and F1-score.

Project Objectives 🎯

The project aimed to evaluate whether machine learning models can effectively classify startups as profitable or unprofitable using key financial and operational metrics.

Identify the key variables influencing startup profitability predictions.
Develop and assess two machine learning models to classify startups as profitable or unprofitable.
Conduct Exploratory Data Analysis (EDA) to uncover patterns, correlations, and distributions.
Uilise Python libraries such as Pandas, Seaborn, and Matplotlib to visualise trends and present findings effectively.
Evaluate model performance using AUC as the primary metric, supplemented by accuracy, recall, and F1-score for a comprehensive analysis.
Analyse the practical implications for investors and entrepreneurs, proposing refinements to enhance predictive accuracy.

Environment 👩🏻‍💻

Stack 🛠️

Repository Structure 🌲

.
├── .gitattributes
├── Startup_Profitability_Prediction_Report_v3.pdf
├── Startup_Profitability_Prediction_v3.ipynb
├── startup_data.csv
└── README.md

Results

Logistic Regression (Before vs. After hyperparameter tuning)

Metric	Before Tuning	After Tuning	Change
Training Accuracy	0.5925	0.5925	No change
Test Accuracy	0.5400	0.5400	No change
Training AUC	0.5927	0.5927	No change
Test AUC	0.5051	0.5051	No change
Mean CV Accuracy	0.5600	0.5600	No change
Mean CV AUC	0.5490	0.5490	No change
False Negative Rate	74.4%	74.4%	No change

Hyperparameter tuning did not improve performance for Logistic Regression.
The model remains underfitted, failing to capture useful patterns.
High false negative rate (74.4%) – meaning many profitable startups were misclassified as unprofitable.

Random Forest (Before vs. After hyperparameter tuning)

Metric	Before Tuning	After Tuning	Change
Training Accuracy	1.0000	0.5775	Reduced (Overfitting mitigated)
Test Accuracy	0.5200	0.5700	Improved
Training AUC	1.0000	0.6183	Reduced (Overfitting mitigated)
Test AUC	0.5204	0.5333	Slightly improved
Mean CV Accuracy	0.5875	0.5425	Slight decline
Mean CV AUC	0.5789	0.5589	Slight decline
False Negative Rate	62.8%	53.5%	Improved (Fewer profitable startups misclassified)

Hyperparameter tuning reduced overfitting in Random Forest.
Test accuracy improved from 52% → 57%, meaning better generalisation.
False negative rate improved, making the model better at correctly identifying profitable startups.
However, AUC score remained low (≈ 0.53), indicating limited predictive power.

Main Findings 🔍

Market Share (%) emerged as the strongest predictor of startup profitability in both models.
- Indicates that a higher market share significantly increases profitability likelihood.
- However, its dominance suggests potential dataset bias toward a single feature.
Funding Rounds was a strong predictor in Logistic Regression but weak in Random Forest.
- Implies a linear impact with limited interactions.
The number of Employees was more influential in Random Forest
- Suggests a non-linear relationship where specific workforce sizes may enhance profitability
Revenue and Funding Amount proved to be weak predictors.
- Indicates that capital alone does not determine profitability.

Recommendations for Improvements 📈

Feature Engineering Enhancements
- Incorporate historical financial trends, cost structures, and customer-related metrics for improved predictions and more holistic assessment of startup profitability
- Integrate external economic and industry trend data
Advanced Modelling Approaches
- Explore Gradient Boosting (XGBoost, LightGBM) to capture non-linear relationships
- Reformulate the problem as a regression task to predict profit margins instead of binary profitability
Data Balancing & Augmentation
- Address class imbalance using oversampling or synthetic data techniques (SMOTE)

Reflection 🪞

This project provided valuable insights into the challenges of predicting startup profitability using machine learning. Through the implementation of Logistic Regression and Random Forest, I gained a deeper understanding of model performance evaluation, feature importance analysis, and the impact of data quality on predictive accuracy.

I improved my skills in data preprocessing, feature selection, and hyperparameter tuning, particularly in understanding how different models interpret financial data. Additionally, working with AUC, precision, recall, and F1-score deepened my knowledge of model evaluation beyond simple accuracy metrics.

Moving forward, I aim to further develop my expertise in advanced machine learning techniques, particularly ensemble methods like XGBoost and LightGBM, which may better capture non-linear relationships in financial prediction. Additionally, I want to explore Explainable AI (e.g., SHAP, LIME) to enhance model interpretability, ensuring that machine learning can provide more actionable insights for investors and entrepreneurs.

This project highlighted the complexity of predicting business success, and I am keen to further study financial data science, feature engineering techniques, and real-world applications of predictive modelling to enhance my analytical capabilities.

Name	Name	Last commit message	Last commit date
Latest commit mijisu0103 Create README.md Mar 18, 2025 2196f12 · Mar 18, 2025 History 5 Commits
.gitattributes	.gitattributes	ipynb to python	Mar 17, 2025
README.md	README.md	Create README.md	Mar 18, 2025
Startup_Profitability_Prediction_Report_v3.pdf	Startup_Profitability_Prediction_Report_v3.pdf	Submitted report	Mar 17, 2025
Startup_Profitability_Prediction_v3.ipynb	Startup_Profitability_Prediction_v3.ipynb	Submitted notebook	Mar 17, 2025
startup_data.csv	startup_data.csv	Startup data	Mar 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About this Repository ℹ️

Motivation ⚙️

Chosen data-related problem

Chosen dataset

Chosen machine learning algorithms

Project Objectives 🎯

Environment 👩🏻‍💻

Stack 🛠️

Repository Structure 🌲

Results

Main Findings 🔍

Recommendations for Improvements 📈

Reflection 🪞

About

Releases

Packages

Languages

mijisu0103/Startup-Profitability-Prediction

Folders and files

Latest commit

History

Repository files navigation

About this Repository ℹ️

Motivation ⚙️

Chosen data-related problem

Chosen dataset

Chosen machine learning algorithms

Project Objectives 🎯

Environment 👩🏻‍💻

Stack 🛠️

Repository Structure 🌲

Results

Main Findings 🔍

Recommendations for Improvements 📈

Reflection 🪞

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages