This project focuses on analyzing the Titanic dataset, which includes information about passengers aboard the RMS Titanic. The goal is to explore the data and build a machine learning model to predict passenger survival based on features such as age, class, gender, and ticket information.
Dataset: https://www.kaggle.com/competitions/titanic
The project involves the following steps:
-
Data Exploration:
- The Titanic dataset is explored to understand the features and the relationships between them. Basic data cleaning and preprocessing are done at this stage.
-
Data Preprocessing:
- The dataset is cleaned by handling missing values, encoding categorical variables, and scaling features to prepare it for machine learning.
-
Model Building:
- A machine learning model (e.g., Logistic Regression, Decision Trees, Random Forest) is built to predict the survival of passengers.
-
Model Evaluation:
- The performance of the model is evaluated using metrics such as accuracy, precision, recall, and F1 score. Cross-validation and hyperparameter tuning are also performed to optimize the model's performance.
-
Visualization:
- Various visualizations are created using libraries like
matplotlib
andseaborn
to better understand the dataset and the relationships between features.
- Various visualizations are created using libraries like
The dataset used in this project is the Titanic dataset from Kaggle, which contains the following columns:
PassengerId
: Unique ID for each passenger.Pclass
: Passenger class (1st, 2nd, or 3rd).Name
: Name of the passenger.Sex
: Gender of the passenger.Age
: Age of the passenger.SibSp
: Number of siblings or spouses aboard the Titanic.Parch
: Number of parents or children aboard the Titanic.Ticket
: Ticket number.Fare
: Fare paid by the passenger.Cabin
: Cabin number.Embarked
: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton).Survived
: Survival status (0 = No, 1 = Yes).
pandas
: For data manipulation and analysis.numpy
: For numerical operations.matplotlib
andseaborn
: For data visualization.scikit-learn
: For building and evaluating machine learning models.xgboost
(optional): For boosting models and improving prediction accuracy.
To get started with this project, follow these steps:
-
Clone or download the repository:
git clone https://github.com/elfgk/Titanic-Data-Analysis.git
-
Install the required Python libraries.
-
Open the titanic_data_analysis.ipynb Jupyter notebook and follow the steps for data exploration, preprocessing, model building, and evaluation.