Creech Capstone Project

Analysis of Lives Saved When Wearing a Proper Helmet

Author: Julie L Creech

Date: April 6, 2023

Data Sets

https://www.nhtsa.gov/file-downloads?p=nhtsa/downloads/FARS/

Using data set Fars2021NationalCSV, Fars2018NationalCSV, Fars2015NationalCSV, Fars2012NationalCSV, Fars2009NationalCSV Within each set of data, used two files: person.csv and accident.csv The data is linked together with a field called ST_CASE

Report

Link to OverLeaf report: https://www.overleaf.com/read/mzdkvqkytrmg

File Descriptions

StatesReqHelmetList.Xlsx: Scraped data from website https://thebradleylawfirm.com/personal-injury-resources/helmet-laws-in-missouri/#Missouri_and_Helmet_Laws/ Formatted the data so that it was columnar data

FinalData.csv contains all the data in one data set. Provided an "Archive for Git" folder that has each year's final data as referenced below. Combining the Person and Accident file is referenced below: Fars2021NationalCSV: Origional Data Set = #records --> After Cleanup = #records person2021.csv accident2021.csv --Used a vLookup to populate the longitude and latitude data to the person file

Fars2018NationalCSV: Origional Data Set = #records --> After Cleanup = #records person2018.csv accident2018.csv--Used a vLookup to populate the longitude and latitude data to the person file

Fars2015NationalCSV: Origional Data Set = #records --> After Cleanup = #records person2015.csv accident2015.csv--Used a vLookup to populate the longitude and latitude data to the person file

Fars2012NationalCSV: Origional Data Set = #records --> After Cleanup = #records person2012.csv accident2012.csv--Used a vLookup to populate the longitude and latitude data to the person file

Fars2020NationalCSV: Origional Data Set = #records --> After Cleanup = #records person2020.csv accident2020.csv--Used a vLookup to populate the longitude and latitude data to the person file

Used Excel to clean the data, but used a Macro so that this would only have to be done manually one time. Removed column that will not be used. Removed rows that were not motorcycle data. Used the VPICBodylassName as the column to filter and identify motorcycles only. Deleted the rest of the data that would never be a resource. Ended up not able to use some of the files because the formatting was different. As it was, column titles were different, and had to do a lot more cleaning than origionally thought. Remaining Fields: From Person.csv STATE STATENAME *ST_CASE MAKENAME MAK_MOD BODY_TYPNAME AGE SEX SEXNAME INJ_SEVNAME DOA DOANAME HELM_USE HELM_USENAME HELM_MIS HELM_MISNAME VPICBODYCLASSNAME

From Accident.csv *ST_CASE Longitude Latitude

Used Excel to do a vLookup function to pull in longitude and latitude data into one table. Move the cleaned data to one file within each folder named FARS[year]National.csv Pull data all together into either one Tableau file or into PostgresSQL

Files used for cleaning and manipulations in Jupyter Notebooks: https://github.com/jcreech72/CreechCapstone/blob/main/data_manipulation.py https://github.com/jcreech72/CreechCapstone/blob/main/data_cleaning.ipynb https://github.com/jcreech72/CreechCapstone/blob/main/Data_Manipulation.ipynb

Tally data in Excel

File Tally Data.xlsx provide a tab call fields where the fields were analyzed for each year. The column titles were not all the same, and the data was sometimes not included in every year. By taking the columns and pasting them in the field, was able to clean the column titles, unify the naming convention and understand what data was missing. Had to use the minimum viable product, meaning that the data that was missing would be unused data for all data sets. While doing the exercise of validating the data, provided a tally table that provided some additional stats. From that, created a chart that shows that for years 2015, 2018, 2020, 2021 there is evidence that there are less rider fatalities of DOT helmet users than with riders who wore no helmet or used a non-DOT approved helmet. The data provides that 62% of fatalities involving riders who wore no helmet over riders who did wear a DOT approved helmet.

Count of Deaths Based on Helmet Choice in 2020

Pivot Data provided some initial data points: FinalDataMapping.XLSX file shows how data was pivoted to provide an Annual YoY view of helmet choice and fatalities results. Count of fatalities in the year 2020 was a sampling used; however, the dataset will not be used for the analysis. During the sampling data, found that there was a total of 5,776 deaths across the united states. Removing data that does not pertain to the data because it is uncertain data, totals 202 Motorcyclists not wearing a helmet at all was 2,275 --> 39% of the fatalities were not wearing a helmet at all. Motoryclists wearing a non-DOT approved helmet or novelty helmet totaled 1,836 --> 32% of the fatalities were wearing a non-approved helmet Motorcycles wearing a DOT approved helmet totaled 1,463 --> 25% of the fatalities were wearing an approved helmet 71% of the fatalities were due to decisions to wear a non-approved or no helmet 25% of the fatalities wore an approved helmet 3% of the data was unuseful.

Cleaning Data

Cleaning data in Jupyter Notebook and can be found in the data_cleaning.ipynb file. Data was found in multiple years of CSV files that did not follow a formatting standard. Cleaning specifics are detailed in Latex file https://www.overleaf.com/read/mzdkvqkytrmg Pulled all data into one file. Using Python, created a script to clean the file: reference "data_cleaning.ipynb. Next, removed null values and dropped nulls totally only 22. Checked the data type and checked for outliers. As expected, noted the SEX to show more males than females, years to be split out between 2015 and 2021, an "Age" range median of 43 with a top range of 96. I was suprised to see the minimum range as 1. Without assuming it was an anomily or outlier due to there being other low-range ages as well.

Data_Manipulation.ipynb Notbook provides the Linear Regression and Chi Square Test. and

Python Notebook

Review Data & Statistics (loaded cleaned csv file into Python notebook) Vizualizations & Finding Outliers Remove Outliers & Look at Distributions Without Outliers Visualizations & Correlations

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Archive Files for Git		Archive Files for Git
2020 Test Case.jpg		2020 Test Case.jpg
2020 Test Case.png		2020 Test Case.png
AnnualFatalbyHelmet.png		AnnualFatalbyHelmet.png
AnnualFatalbyHelmet_.png		AnnualFatalbyHelmet_.png
Chi-Sq1.png		Chi-Sq1.png
Chi-Sq2.png		Chi-Sq2.png
Cleaning1.png		Cleaning1.png
Cleaning2.png		Cleaning2.png
CreechCapstone.ipynb		CreechCapstone.ipynb
Data_Manipulation.ipynb		Data_Manipulation.ipynb
FinalData.csv		FinalData.csv
FinalDataMapping.xlsx		FinalDataMapping.xlsx
Missouri Fatality Trend.png		Missouri Fatality Trend.png
PredictiveAnalysis.ipynb		PredictiveAnalysis.ipynb
PythonLR1.png		PythonLR1.png
PythonLR2.png		PythonLR2.png
PythonLR3.png		PythonLR3.png
ReadMe.md		ReadMe.md
Tally Data.xlsx		Tally Data.xlsx
data_analysis.py		data_analysis.py
data_cleaning.ipynb		data_cleaning.ipynb
data_manipulation.py		data_manipulation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Creech Capstone Project

Analysis of Lives Saved When Wearing a Proper Helmet

Data Sets

Report

File Descriptions

Tally data in Excel

Count of Deaths Based on Helmet Choice in 2020

Cleaning Data

Python Notebook

About

Releases

Packages

Languages

jcreech72/CreechCapstone

Folders and files

Latest commit

History

Repository files navigation

Creech Capstone Project

Analysis of Lives Saved When Wearing a Proper Helmet

Data Sets

Report

File Descriptions

Tally data in Excel

Count of Deaths Based on Helmet Choice in 2020

Cleaning Data

Python Notebook

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages