This dataset is used to predict complications of Myocardial Infarction (MI) based on the information about the patient. The target value 0 is no complication and 1 means complication within the first three days of hospitalization.
MI is one of the most challenging problems of modern medicine. Acute myocardial infarction is associated with high mortality in the first year after it. The incidence of MI remains high in all countries. This is especially true for the urban population of highly developed countries, which is exposed to chronic stress factors, irregular and not always balanced nutrition. In the United States, for example, more than a million people suffer from MI every year, and 200-300 thousand of them die from acute MI before arriving at the hospital. In this regard, predicting complications of myocardial infarction in order to timely carry out the necessary preventive measures is an important task.
- Age
- Gender
- Myocardial: Quantity of myocardial infarctions in the anamnesis – Ordinal
- Exertional angina: Exertional angina pectoris in the anamnesis
- FC: Functional class (FC) of angina pectoris in the last year – Ordinal
- Heart Disease: Coronary heart disease (CHD) in recent weeks, days before admission to hospital
- Heredity: Heredity on CHD
- Hypertension: Presence of an essential hypertension
- Symptomatic hypertension
- Duration: Duration of arterial hypertension
- Arrhythmia: Observing of arrhythmia in the anamnesis
- Systolic_emergency: Systolic blood pressure according to Emergency Cardiology Team
- Diastolic_emergency: Diastolic blood pressure according to Emergency Cardiology Team
- Systolic_intensive_care: Systolic blood pressure according to intensive care unit
- Diastolic_intensive_care: Diastolic blood pressure according to intensive care unit
- Potassium: Serum potassium content
- Sodium: Serum sodium content
- AlAT: Serum AlAT content
- AsTK: Serum AsTK content
- WBC: White Blood Cell Count
- ESR: Erythrocyte sedimentation rate
- Time: Time elapsed from the beginning of the attack of CHD to the hospital
- Outcome: target column
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1275 entries, 0 to 1274
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1275 non-null object
1 Gender 1275 non-null object
2 myocardial 1275 non-null object
3 Exertional angina 1275 non-null object
4 FC 1275 non-null object
5 Heart Disease 1275 non-null object
6 Heredity 1275 non-null object
7 Hypertension 1275 non-null object
8 Symptomatic hypertension 1275 non-null object
9 Duration 1275 non-null object
10 Arrhythmia 1275 non-null object
11 Systolic_emergency 1275 non-null object
12 Diastolic_emergency 1275 non-null object
13 Systolic_intensive_care 1275 non-null object
14 Diastolic_intensive_care 1275 non-null object
15 Potassium 1275 non-null object
16 Sodium 1275 non-null object
17 AlAT 1275 non-null object
18 AsAT 1275 non-null object
19 WBC 1275 non-null object
20 ESR 1275 non-null object
21 Time 1275 non-null object
22 Outcome 1275 non-null int64
dtypes: int64(1), object(22)
memory usage: 229.2+ KB
We can see that all the columns are of type object and hence we don't see any missing values here.
train.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Age | Gender | myocardial | Exertional angina | FC | Heart Disease | Heredity | Hypertension | Symptomatic hypertension | Duration | ... | Systolic_intensive_care | Diastolic_intensive_care | Potassium | Sodium | AlAT | AsAT | WBC | ESR | Time | Outcome | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 75 | Female | 0 | Two years ago | II FC | Exertional angina | ? | Stage 2 | No | ? | ... | 140 | 90 | ? | ? | 0.3 | 0.18 | 7.8 | 16 | 7 | 0 |
1 | 50 | Male | 1 | Two years ago | II FC | Unstable angina | ? | Stage 2 | No | One year | ... | ? | ? | 3.9 | 132 | 0.23 | 0.52 | 6.2 | 20 | 7 | 0 |
2 | 54 | Male | 0 | Never | No angina | No angina | ? | No | No | No hypertension | ... | 140 | 100 | ? | ? | ? | ? | 6.9 | 6 | ? | 0 |
3 | 51 | Male | ? | ? | ? | Unstable angina | ? | ? | ? | ? | ... | 0 | 0 | ? | ? | ? | ? | ? | ? | 2 | 1 |
4 | 76 | Female | 3 | Never | No angina | Unstable angina | ? | Stage 2 | No | More than 10 years | ... | 110 | 70 | ? | ? | 0.15 | 0.26 | 4 | 5 | 7 | 0 |
5 rows Ă— 23 columns
From observing the top 5 rows, we see that a lot of columns have '?' in it indicating missing values.
Replacing '?' across the columns as NaN
for col in train.columns:
train[col] = train[col].replace('?',np.nan)
Converting numerical columns to float
train['Age'] = train['Age'].astype(float)
train['myocardial'] = train['myocardial'].astype(float)
train[train.columns[11:22]] = train[train.columns[11:22]].astype(float)
train.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Age | myocardial | Systolic_emergency | Diastolic_emergency | Systolic_intensive_care | Diastolic_intensive_care | Potassium | Sodium | AlAT | AsAT | WBC | ESR | Time | Outcome | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1274.000000 | 1274.00000 | 474.000000 | 474.000000 | 1066.000000 | 1066.000000 | 993.000000 | 992.000000 | 1067.000000 | 1066.000000 | 1188.000000 | 1132.000000 | 1178.000000 | 1275.000000 |
mean | 64.154631 | 0.56044 | 137.700422 | 82.004219 | 134.812383 | 83.076923 | 4.194361 | 136.607863 | 0.472671 | 0.262336 | 8.843519 | 13.475265 | 4.702037 | 0.160784 |
std | 46.793076 | 0.83419 | 34.681988 | 19.997145 | 31.734114 | 18.631784 | 0.770440 | 6.598662 | 0.386188 | 0.206220 | 3.449176 | 10.796416 | 2.858370 | 0.367476 |
min | 26.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.300000 | 117.000000 | 0.030000 | 0.040000 | 2.000000 | 1.000000 | 1.000000 | 0.000000 |
25% | 54.000000 | 0.00000 | 120.000000 | 70.000000 | 120.000000 | 80.000000 | 3.700000 | 133.000000 | 0.230000 | 0.150000 | 6.400000 | 5.000000 | 2.000000 | 0.000000 |
50% | 63.000000 | 0.00000 | 140.000000 | 80.000000 | 130.000000 | 80.000000 | 4.100000 | 136.000000 | 0.380000 | 0.220000 | 8.100000 | 10.000000 | 4.000000 | 0.000000 |
75% | 70.000000 | 1.00000 | 160.000000 | 90.000000 | 150.000000 | 90.000000 | 4.600000 | 140.000000 | 0.610000 | 0.300000 | 10.500000 | 19.000000 | 7.000000 | 0.000000 |
max | 999.000000 | 3.00000 | 260.000000 | 190.000000 | 260.000000 | 190.000000 | 8.000000 | 169.000000 | 3.000000 | 2.150000 | 27.900000 | 68.000000 | 9.000000 | 1.000000 |
train['Age'] = train['Age'].replace(999,np.nan)
train.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Age | myocardial | Systolic_emergency | Diastolic_emergency | Systolic_intensive_care | Diastolic_intensive_care | Potassium | Sodium | AlAT | AsAT | WBC | ESR | Time | Outcome | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1271.000000 | 1274.00000 | 474.000000 | 474.000000 | 1066.000000 | 1066.000000 | 993.000000 | 992.000000 | 1067.000000 | 1066.000000 | 1188.000000 | 1132.000000 | 1178.000000 | 1275.000000 |
mean | 61.948072 | 0.56044 | 137.700422 | 82.004219 | 134.812383 | 83.076923 | 4.194361 | 136.607863 | 0.472671 | 0.262336 | 8.843519 | 13.475265 | 4.702037 | 0.160784 |
std | 11.201609 | 0.83419 | 34.681988 | 19.997145 | 31.734114 | 18.631784 | 0.770440 | 6.598662 | 0.386188 | 0.206220 | 3.449176 | 10.796416 | 2.858370 | 0.367476 |
min | 26.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.300000 | 117.000000 | 0.030000 | 0.040000 | 2.000000 | 1.000000 | 1.000000 | 0.000000 |
25% | 54.000000 | 0.00000 | 120.000000 | 70.000000 | 120.000000 | 80.000000 | 3.700000 | 133.000000 | 0.230000 | 0.150000 | 6.400000 | 5.000000 | 2.000000 | 0.000000 |
50% | 63.000000 | 0.00000 | 140.000000 | 80.000000 | 130.000000 | 80.000000 | 4.100000 | 136.000000 | 0.380000 | 0.220000 | 8.100000 | 10.000000 | 4.000000 | 0.000000 |
75% | 70.000000 | 1.00000 | 160.000000 | 90.000000 | 150.000000 | 90.000000 | 4.600000 | 140.000000 | 0.610000 | 0.300000 | 10.500000 | 19.000000 | 7.000000 | 0.000000 |
max | 92.000000 | 3.00000 | 260.000000 | 190.000000 | 260.000000 | 190.000000 | 8.000000 | 169.000000 | 3.000000 | 2.150000 | 27.900000 | 68.000000 | 9.000000 | 1.000000 |
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1275 entries, 0 to 1274
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1271 non-null float64
1 Gender 1275 non-null object
2 myocardial 1274 non-null float64
3 Exertional angina 1202 non-null object
4 FC 1225 non-null object
5 Heart Disease 1235 non-null object
6 Heredity 57 non-null object
7 Hypertension 1271 non-null object
8 Symptomatic hypertension 1272 non-null object
9 Duration 1085 non-null object
10 Arrhythmia 1261 non-null object
11 Systolic_emergency 474 non-null float64
12 Diastolic_emergency 474 non-null float64
13 Systolic_intensive_care 1066 non-null float64
14 Diastolic_intensive_care 1066 non-null float64
15 Potassium 993 non-null float64
16 Sodium 992 non-null float64
17 AlAT 1067 non-null float64
18 AsAT 1066 non-null float64
19 WBC 1188 non-null float64
20 ESR 1132 non-null float64
21 Time 1178 non-null float64
22 Outcome 1275 non-null int64
dtypes: float64(13), int64(1), object(9)
memory usage: 229.2+ KB
for col in test.columns:
test[col] = test[col].replace('?',np.nan)
#Converting numerical columns to float
test['Age'] = test['Age'].astype(float)
test['myocardial'] = test['myocardial'].astype(float)
test[test.columns[11:22]] = test[test.columns[11:22]].astype(float)
test['Age'] = test['Age'].replace(999,np.nan)
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 425 entries, 0 to 424
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 421 non-null float64
1 Gender 425 non-null object
2 myocardial 422 non-null float64
3 Exertional angina 392 non-null object
4 FC 402 non-null object
5 Heart Disease 414 non-null object
6 Heredity 15 non-null object
7 Hypertension 420 non-null object
8 Symptomatic hypertension 420 non-null object
9 Duration 367 non-null object
10 Arrhythmia 418 non-null object
11 Systolic_emergency 150 non-null float64
12 Diastolic_emergency 150 non-null float64
13 Systolic_intensive_care 367 non-null float64
14 Diastolic_intensive_care 367 non-null float64
15 Potassium 336 non-null float64
16 Sodium 333 non-null float64
17 AlAT 349 non-null float64
18 AsAT 349 non-null float64
19 WBC 387 non-null float64
20 ESR 365 non-null float64
21 Time 396 non-null float64
dtypes: float64(13), object(9)
memory usage: 73.2+ KB
Observing the Missing values
train.isnull().any(axis= 'columns').sum()
1267
train.isnull().sum()
Age 4
Gender 0
myocardial 1
Exertional angina 73
FC 50
Heart Disease 40
Heredity 1218
Hypertension 4
Symptomatic hypertension 3
Duration 190
Arrhythmia 14
Systolic_emergency 801
Diastolic_emergency 801
Systolic_intensive_care 209
Diastolic_intensive_care 209
Potassium 282
Sodium 283
AlAT 208
AsAT 209
WBC 87
ESR 143
Time 97
Outcome 0
dtype: int64
train.Age.hist()
plt.title('Age histogram')
Text(0.5, 1.0, 'Age histogram')
The Age histogram is approximately normally distributed with Mean ~ 62 and Median ~ 63 and will hence use Median imputation
train[['Age','Outcome']].groupby('Outcome').median()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Age | |
---|---|
Outcome | |
0 | 62.0 |
1 | 67.0 |
We see that mean Age for people tested to have MI complication is greater than people who have no MI complication.
train[train['Age'].isnull()]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Age | Gender | myocardial | Exertional angina | FC | Heart Disease | Heredity | Hypertension | Symptomatic hypertension | Duration | ... | Systolic_intensive_care | Diastolic_intensive_care | Potassium | Sodium | AlAT | AsAT | WBC | ESR | Time | Outcome | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
248 | NaN | Male | 0.0 | Two years ago | II FC | Unstable angina | NaN | Stage 2 | No | More than 10 years | ... | NaN | NaN | 4.7 | 142.0 | 0.30 | 0.07 | 10.3 | 17.0 | 6.0 | 0 |
295 | NaN | Male | 0.0 | Two years ago | II FC | Unstable angina | NaN | No | No | No hypertension | ... | NaN | NaN | 3.9 | 131.0 | 0.45 | 0.30 | 12.7 | 9.0 | 3.0 | 0 |
547 | NaN | Female | 2.0 | More than five years ago | II FC | Exertional angina | NaN | Stage 2 | No | Six to ten years | ... | 100.0 | 60.0 | 4.6 | 132.0 | 0.75 | 0.22 | 5.6 | 14.0 | 3.0 | 0 |
729 | NaN | Male | 0.0 | Never | No angina | No angina | NaN | No | No | No hypertension | ... | 140.0 | 90.0 | NaN | NaN | NaN | NaN | 15.5 | 10.0 | 8.0 | 0 |
4 rows Ă— 23 columns
train[['Age','Gender']].groupby('Gender').median()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Age | |
---|---|
Gender | |
Female | 68.0 |
Male | 59.0 |
Median age in the dataset for Female is 68 and Male is 59.
Age imputation will be done maintaining this distribution
train['Age'] = train['Age'].fillna(train.groupby('Gender')['Age'].transform('median'))
test['Age'] = test['Age'].fillna(train.groupby('Gender')['Age'].transform('median'))
train.Gender.value_counts().plot(kind = 'bar', title = 'Gender counts')
<AxesSubplot:title={'center':'Gender counts'}>
We can see that there are more Male than Female in the dataset. Mapping Male to 1 and Female to 0
train['Gender']=train['Gender'].map({'Female':0,'Male':1}).astype(int)
test['Gender']=test['Gender'].map({'Female':0,'Male':1}).astype(int)
train['myocardial'].value_counts().plot(kind = 'bar', title = 'Myocardial counts')
<AxesSubplot:title={'center':'Myocardial counts'}>
We see that 0 is the dominant value
Examining Heart Disease and Myocardial value to check if it has a variation.
train[['Heart Disease','myocardial']].groupby('Heart Disease').agg(pd.Series.mode)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
myocardial | |
---|---|
Heart Disease | |
Exertional angina | 0.0 |
No angina | 0.0 |
Unstable angina | 0.0 |
We don't see a difference and is hence not useful.
Imputing mode = 0 to the missing value
train['myocardial'] = train['myocardial'].fillna(0).astype(int)
test['myocardial'] = test['myocardial'] .fillna(0).astype(int)
train['Exertional angina'].value_counts()
Never 504
More than five years ago 254
During the last year 105
One year ago 103
Two years ago 92
Four to five years ago 91
Three years ago 53
Name: Exertional angina, dtype: int64
train['Exertional angina'].value_counts().plot(kind = 'bar', title = 'Exertional Angina counts')
<AxesSubplot:title={'center':'Exertional Angina counts'}>
#mode imputation and one-hot vector encoding - train data
train['Exertional angina'].replace(np.NaN, 'Never', inplace = True)
cols = pd.get_dummies(train['Exertional angina'], prefix = 'Exertionalangina')
train[cols.columns] = cols
train.drop('Exertional angina', axis = 1, inplace = True)
#mode imputation and one-hot vector encoding - test data
test['Exertional angina'].replace(np.NaN, 'Never', inplace = True)
cols = pd.get_dummies(test['Exertional angina'],prefix = 'Exertionalangina')
test[cols.columns] = cols
test.drop('Exertional angina', axis = 1, inplace = True)
train['FC'].value_counts()
II FC 639
No angina 503
III FC 38
I FC 37
IV FC 8
Name: FC, dtype: int64
train['FC'].value_counts().plot(kind = 'bar', title = 'Functional class count')
<AxesSubplot:title={'center':'Functional class count'}>
Replacing NaN with most frequent value and mapping the categorical value to numerical,thereby making it ordinal
train['FC'].replace(np.NaN, 'II FC', inplace = True)
train['FC'] = train['FC'].map({'No angina':0,
'I FC':1,
'II FC':2,
'III FC':3,
'IV FC':4 }).astype(int)
test['FC'].replace(np.NaN, 'II FC', inplace = True)
test['FC'] = test['FC'].map({'No angina':0,
'I FC':1,
'II FC':2,
'III FC':3,
'IV FC':4 }).astype(int)
train['Heart Disease'].value_counts().plot(kind = 'bar', title = 'Heart Disease counts' )
<AxesSubplot:title={'center':'Heart Disease counts'}>
Replacing NaN with most common category and one-hot encoding the column
train['Heart Disease'].fillna('Unstable angina', inplace = True)
cols = pd.get_dummies(train['Heart Disease'], prefix = 'HeartDisease')
train[cols.columns] = cols
train.drop('Heart Disease', axis = 1, inplace = True)
test['Heart Disease'].fillna('Unstable angina', inplace = True)
cols = pd.get_dummies(test['Heart Disease'], prefix = 'HeartDisease')
test[cols.columns] = cols
test.drop('Heart Disease', axis = 1, inplace = True)
train['Hypertension'].unique()
array(['Stage 2', 'No', nan, 'Stage 3', 'Stage 1'], dtype=object)
train['Hypertension'].value_counts()
Stage 2 670
No 445
Stage 3 148
Stage 1 8
Name: Hypertension, dtype: int64
train['Hypertension'].fillna('Stage 2', inplace = True)
cols = pd.get_dummies(train['Hypertension'], prefix = 'Hypertension')
train[cols.columns] = cols
train.drop('Hypertension', axis = 1, inplace = True)
test['Hypertension'].fillna('Stage 2', inplace = True)
cols = pd.get_dummies(test['Hypertension'],prefix = 'Hypertension')
test[cols.columns] = cols
test.drop('Hypertension', axis = 1, inplace = True)
train['Symptomatic hypertension'].unique()
array(['No', nan, 'Yes'], dtype=object)
train['Symptomatic hypertension'].value_counts().plot(kind = 'bar', title = 'Symptomatic Hypertension count')
<AxesSubplot:title={'center':'Symptomatic Hypertension count'}>
Most of the values are No. Replacing NA's with 'No' and encoding most frequent value 'No' as 1 and 'Yes' as 0
train['Symptomatic hypertension'].fillna('No', inplace = True)
train['Symptomatic hypertension'] = train['Symptomatic hypertension'].map({'No':1,'Yes':0})
test['Symptomatic hypertension'].fillna('No', inplace = True)
test['Symptomatic hypertension'] = test['Symptomatic hypertension'].map({'No':1,'Yes':0})
test['Symptomatic hypertension'].value_counts()
1 417
0 8
Name: Symptomatic hypertension, dtype: int64
train['Duration'].unique()
array([nan, 'One year', 'No hypertension', 'More than 10 years',
'Six to ten years', 'Three years', 'Five years', 'Four years',
'Two years'], dtype=object)
train['Duration'].value_counts().plot(kind = 'bar', title = 'Duration counts')
<AxesSubplot:title={'center':'Duration counts'}>
Replacing NAs with 'No hypertension' and one-hot encoding the column
train['Duration'].fillna('No hypertension', inplace = True)
cols = pd.get_dummies(train['Duration'],prefix = 'Duration')
train[cols.columns] = cols
train.drop('Duration', axis = 1, inplace = True)
test['Duration'].fillna('No hypertension', inplace = True)
cols = pd.get_dummies(test['Duration'],prefix = 'Duration')
test[cols.columns] = cols
test.drop('Duration', axis = 1, inplace = True)
train['Arrhythmia'].unique()
array(['No', nan, 'Yes'], dtype=object)
train['Arrhythmia'].value_counts().plot(kind = 'bar', title = 'Arrhythmia counts')
<AxesSubplot:title={'center':'Arrhythmia counts'}>
Replacing NAs with 'No' and mapping 'No' to 1 and 'Yes' to 0
train['Arrhythmia'].fillna('No', inplace = True)
train['Arrhythmia']= train['Arrhythmia'].map({'No':1,'Yes':0})
test['Arrhythmia'].fillna('No', inplace = True)
test['Arrhythmia']= test['Arrhythmia'].map({'No':1,'Yes':0})
train['Systolic_intensive_care'].describe()
count 1066.000000
mean 134.812383
std 31.734114
min 0.000000
25% 120.000000
50% 130.000000
75% 150.000000
max 260.000000
Name: Systolic_intensive_care, dtype: float64
train['Systolic_intensive_care'].hist()
<AxesSubplot:>
Replacing NAs with Median
train['Systolic_intensive_care'].fillna(train['Systolic_intensive_care'].median(), inplace = True)
test['Systolic_intensive_care'].fillna(train['Systolic_intensive_care'].median(), inplace = True)
train['Diastolic_intensive_care'].describe()
count 1066.000000
mean 83.076923
std 18.631784
min 0.000000
25% 80.000000
50% 80.000000
75% 90.000000
max 190.000000
Name: Diastolic_intensive_care, dtype: float64
train['Diastolic_intensive_care'].hist()
<AxesSubplot:>
Replacing NAs with median
train['Diastolic_intensive_care'].fillna(train['Diastolic_intensive_care'].median(), inplace = True)
test['Diastolic_intensive_care'].fillna(train['Diastolic_intensive_care'].median(), inplace = True)
train['Potassium'].describe()
count 993.000000
mean 4.194361
std 0.770440
min 2.300000
25% 3.700000
50% 4.100000
75% 4.600000
max 8.000000
Name: Potassium, dtype: float64
train['Potassium'].hist()
<AxesSubplot:>
train['Potassium'].fillna(train['Potassium'].median(), inplace = True)
test['Potassium'].fillna(train['Potassium'].median(), inplace = True)
train['Sodium'].describe()
count 992.000000
mean 136.607863
std 6.598662
min 117.000000
25% 133.000000
50% 136.000000
75% 140.000000
max 169.000000
Name: Sodium, dtype: float64
train['Sodium'].hist()
<AxesSubplot:>
train['Sodium'].fillna(train['Sodium'].median(), inplace = True)
test['Sodium'].fillna(train['Sodium'].median(), inplace = True)
train['AlAT'].describe()
count 1067.000000
mean 0.472671
std 0.386188
min 0.030000
25% 0.230000
50% 0.380000
75% 0.610000
max 3.000000
Name: AlAT, dtype: float64
train['AlAT'].hist()
<AxesSubplot:>
train['AlAT'].fillna(train['AlAT'].median(), inplace = True)
test['AlAT'].fillna(train['AlAT'].median(), inplace = True)
train['AsAT'].describe()
count 1066.000000
mean 0.262336
std 0.206220
min 0.040000
25% 0.150000
50% 0.220000
75% 0.300000
max 2.150000
Name: AsAT, dtype: float64
train['AsAT'].hist()
<AxesSubplot:>
train['AsAT'].fillna(train['AsAT'].median(), inplace = True)
test['AsAT'].fillna(train['AsAT'].median(), inplace = True)
train['WBC'].describe()
count 1188.000000
mean 8.843519
std 3.449176
min 2.000000
25% 6.400000
50% 8.100000
75% 10.500000
max 27.900000
Name: WBC, dtype: float64
train['WBC'].hist()
<AxesSubplot:>
train['WBC'].fillna(train['WBC'].median(), inplace = True)
test['WBC'].fillna(train['WBC'].median(), inplace = True)
train['ESR'].describe()
count 1132.000000
mean 13.475265
std 10.796416
min 1.000000
25% 5.000000
50% 10.000000
75% 19.000000
max 68.000000
Name: ESR, dtype: float64
train['ESR'].hist()
<AxesSubplot:>
train['ESR'].fillna(train['ESR'].median(), inplace = True)
test['ESR'].fillna(train['ESR'].median(), inplace = True)
train['Time'].describe()
count 1178.000000
mean 4.702037
std 2.858370
min 1.000000
25% 2.000000
50% 4.000000
75% 7.000000
max 9.000000
Name: Time, dtype: float64
train['Time'].hist()
<AxesSubplot:>
train['Time'].fillna(train['Time'].median(), inplace = True)
test['Time'].fillna(train['Time'].median(), inplace = True)
train['Outcome'].value_counts().plot(kind = 'bar', title = 'Outcome Counts')
<AxesSubplot:title={'center':'Outcome Counts'}>
We see that the value 0 is dominant, i.e., there are more 0's than 1's and the data is an unbalanced dataset
train.drop('Heredity', axis = 1, inplace = True)
test.drop('Heredity', axis = 1, inplace = True)
train.drop('Systolic_emergency', axis = 1, inplace = True)
test.drop('Systolic_emergency', axis = 1, inplace = True)
train.drop('Diastolic_emergency', axis = 1, inplace = True)
test.drop('Diastolic_emergency', axis = 1, inplace = True)
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 425 entries, 0 to 424
Data columns (total 37 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 425 non-null float64
1 Gender 425 non-null int64
2 myocardial 425 non-null int64
3 FC 425 non-null int64
4 Symptomatic hypertension 425 non-null int64
5 Arrhythmia 425 non-null int64
6 Systolic_intensive_care 425 non-null float64
7 Diastolic_intensive_care 425 non-null float64
8 Potassium 425 non-null float64
9 Sodium 425 non-null float64
10 AlAT 425 non-null float64
11 AsAT 425 non-null float64
12 WBC 425 non-null float64
13 ESR 425 non-null float64
14 Time 425 non-null float64
15 Exertionalangina_During the last year 425 non-null uint8
16 Exertionalangina_Four to five years ago 425 non-null uint8
17 Exertionalangina_More than five years ago 425 non-null uint8
18 Exertionalangina_Never 425 non-null uint8
19 Exertionalangina_One year ago 425 non-null uint8
20 Exertionalangina_Three years ago 425 non-null uint8
21 Exertionalangina_Two years ago 425 non-null uint8
22 HeartDisease_Exertional angina 425 non-null uint8
23 HeartDisease_No angina 425 non-null uint8
24 HeartDisease_Unstable angina 425 non-null uint8
25 Hypertension_No 425 non-null uint8
26 Hypertension_Stage 1 425 non-null uint8
27 Hypertension_Stage 2 425 non-null uint8
28 Hypertension_Stage 3 425 non-null uint8
29 Duration_Five years 425 non-null uint8
30 Duration_Four years 425 non-null uint8
31 Duration_More than 10 years 425 non-null uint8
32 Duration_No hypertension 425 non-null uint8
33 Duration_One year 425 non-null uint8
34 Duration_Six to ten years 425 non-null uint8
35 Duration_Three years 425 non-null uint8
36 Duration_Two years 425 non-null uint8
dtypes: float64(10), int64(5), uint8(22)
memory usage: 59.1 KB
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1275 entries, 0 to 1274
Data columns (total 38 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1275 non-null float64
1 Gender 1275 non-null int64
2 myocardial 1275 non-null int64
3 FC 1275 non-null int64
4 Symptomatic hypertension 1275 non-null int64
5 Arrhythmia 1275 non-null int64
6 Systolic_intensive_care 1275 non-null float64
7 Diastolic_intensive_care 1275 non-null float64
8 Potassium 1275 non-null float64
9 Sodium 1275 non-null float64
10 AlAT 1275 non-null float64
11 AsAT 1275 non-null float64
12 WBC 1275 non-null float64
13 ESR 1275 non-null float64
14 Time 1275 non-null float64
15 Outcome 1275 non-null int64
16 Exertionalangina_During the last year 1275 non-null uint8
17 Exertionalangina_Four to five years ago 1275 non-null uint8
18 Exertionalangina_More than five years ago 1275 non-null uint8
19 Exertionalangina_Never 1275 non-null uint8
20 Exertionalangina_One year ago 1275 non-null uint8
21 Exertionalangina_Three years ago 1275 non-null uint8
22 Exertionalangina_Two years ago 1275 non-null uint8
23 HeartDisease_Exertional angina 1275 non-null uint8
24 HeartDisease_No angina 1275 non-null uint8
25 HeartDisease_Unstable angina 1275 non-null uint8
26 Hypertension_No 1275 non-null uint8
27 Hypertension_Stage 1 1275 non-null uint8
28 Hypertension_Stage 2 1275 non-null uint8
29 Hypertension_Stage 3 1275 non-null uint8
30 Duration_Five years 1275 non-null uint8
31 Duration_Four years 1275 non-null uint8
32 Duration_More than 10 years 1275 non-null uint8
33 Duration_No hypertension 1275 non-null uint8
34 Duration_One year 1275 non-null uint8
35 Duration_Six to ten years 1275 non-null uint8
36 Duration_Three years 1275 non-null uint8
37 Duration_Two years 1275 non-null uint8
dtypes: float64(10), int64(6), uint8(22)
memory usage: 186.9 KB
#Visualizing few columns
from pandas.plotting import scatter_matrix
X = train.drop('Outcome', axis = 1)
y = train['Outcome']
attributes = X.columns[6:11]
scatter_matrix(X[attributes], figsize = (15,15), c = y, alpha = 0.8, marker = 'O')
array([[<AxesSubplot:xlabel='Systolic_intensive_care', ylabel='Systolic_intensive_care'>,
<AxesSubplot:xlabel='Diastolic_intensive_care', ylabel='Systolic_intensive_care'>,
<AxesSubplot:xlabel='Potassium', ylabel='Systolic_intensive_care'>,
<AxesSubplot:xlabel='Sodium', ylabel='Systolic_intensive_care'>,
<AxesSubplot:xlabel='AlAT', ylabel='Systolic_intensive_care'>],
[<AxesSubplot:xlabel='Systolic_intensive_care', ylabel='Diastolic_intensive_care'>,
<AxesSubplot:xlabel='Diastolic_intensive_care', ylabel='Diastolic_intensive_care'>,
<AxesSubplot:xlabel='Potassium', ylabel='Diastolic_intensive_care'>,
<AxesSubplot:xlabel='Sodium', ylabel='Diastolic_intensive_care'>,
<AxesSubplot:xlabel='AlAT', ylabel='Diastolic_intensive_care'>],
[<AxesSubplot:xlabel='Systolic_intensive_care', ylabel='Potassium'>,
<AxesSubplot:xlabel='Diastolic_intensive_care', ylabel='Potassium'>,
<AxesSubplot:xlabel='Potassium', ylabel='Potassium'>,
<AxesSubplot:xlabel='Sodium', ylabel='Potassium'>,
<AxesSubplot:xlabel='AlAT', ylabel='Potassium'>],
[<AxesSubplot:xlabel='Systolic_intensive_care', ylabel='Sodium'>,
<AxesSubplot:xlabel='Diastolic_intensive_care', ylabel='Sodium'>,
<AxesSubplot:xlabel='Potassium', ylabel='Sodium'>,
<AxesSubplot:xlabel='Sodium', ylabel='Sodium'>,
<AxesSubplot:xlabel='AlAT', ylabel='Sodium'>],
[<AxesSubplot:xlabel='Systolic_intensive_care', ylabel='AlAT'>,
<AxesSubplot:xlabel='Diastolic_intensive_care', ylabel='AlAT'>,
<AxesSubplot:xlabel='Potassium', ylabel='AlAT'>,
<AxesSubplot:xlabel='Sodium', ylabel='AlAT'>,
<AxesSubplot:xlabel='AlAT', ylabel='AlAT'>]], dtype=object)
Since the dataset is unbalanced with the Outcome variable having less 1's and more 0's, accuracy will not be a good predictor.
Recall oriented scoring such as F1 or ROC AUC will be needed.
ROC AUC scoring will be used for model evaluation.
X = train.drop('Outcome', axis = 1)
y = train['Outcome']
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
X_train_org, X_val_org, y_train, y_val = train_test_split(X,y, random_state = 0)
#Scaling the dataset using MinMax Scaler
scaler = MinMaxScaler()
colnames = X.columns
X_train = scaler.fit_transform(X_train_org)
X_val = scaler.transform(X_val_org)
test_data = scaler.transform(test)
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
param_grid = { 'n_neighbors' : range(1,20) }
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv = 5, scoring = 'roc_auc').fit(X_train,y_train)
print('Cross validation score : ',grid_search.best_score_)
print('Best parameters : ',grid_search.best_params_)
Cross validation score : 0.6725050403225807
Best parameters : {'n_neighbors': 14}
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
param_grid1 = { 'C' : [0.01,0.1,1,10,100],
'max_iter' : [100,250,500]}
logit = LogisticRegression(solver = 'lbfgs',random_state = 0)
grid_log = GridSearchCV(logit, param_grid1, cv = 5, scoring = 'roc_auc', n_jobs = -1).fit(X_train,y_train)
print('Best parameters',grid_log.best_params_)
print('Best ROC_AUC score', grid_log.best_score_)
Best parameters {'C': 1, 'max_iter': 100}
Best ROC_AUC score 0.7442716733870969
from sklearn.svm import SVC
param_grid3 = {'C': [0.01, 0.1, 1, 10, 100],
'gamma': [0.01, 0.1, 1, 10, 100]}
rbf = SVC(kernel = 'rbf', random_state = 0)
grid_rbf = GridSearchCV(rbf, param_grid3,cv =5, n_jobs = -1, scoring = 'roc_auc').fit(X_train,y_train)
print('Best parameters:', grid_rbf.best_params_)
print('Best AUC ROC score : ', grid_rbf.best_score_)
Best parameters: {'C': 1, 'gamma': 0.1}
Best AUC ROC score : 0.7363193044354839
from sklearn.tree import DecisionTreeClassifier
param_grid4 = {'max_depth': [2,3,4,5,6,7,8,9,10,11] }
tree = DecisionTreeClassifier(random_state = 0)
grid_tree = GridSearchCV(tree, param_grid4,cv =5, n_jobs = -1, scoring = 'roc_auc').fit(X_train,y_train)
print('Best parameters:', grid_tree.best_params_)
print('Best AUC ROC score : ', grid_tree.best_score_)
Best parameters: {'max_depth': 5}
Best AUC ROC score : 0.696522177419355
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
param_grid = {'max_samples':[0.01, 0.05, 0.1,0.5,1],
'max_features':[0.01, 0.05, 0.1,0.5,1],
'bootstrap' : [True, False]}
log = LogisticRegression(C = 1, max_iter = 100,penalty = 'l2', solver = 'lbfgs', random_state = 0)
bg = BaggingClassifier(log, random_state = 0)
grid_bg = GridSearchCV(bg, param_grid = param_grid, cv = 5,scoring='roc_auc', n_jobs = -1).fit(X_train,y_train)
print('Best parameters:', grid_bg.best_params_)
print('Best AUC ROC score : ', grid_bg.best_score_)
Best parameters: {'bootstrap': False, 'max_features': 0.1, 'max_samples': 0.5}
Best AUC ROC score : 0.7519468245967742
from sklearn.ensemble import GradientBoostingClassifier
param_grid2 = {'learning_rate':[0.001, 0.01, 0.1,0.5],
'n_estimators':[100,200, 500, 1000],
'max_depth' : [2,3,4,5]}
gbrt = GradientBoostingClassifier(random_state=0)
grid_gb = GridSearchCV(gbrt, param_grid2 ,cv = 5, scoring = 'roc_auc', n_jobs = -1).fit(X_train,y_train)
print('Best parameters:', grid_gb.best_params_)
print('Best AUC ROC score : ', grid_gb.best_score_)
Best parameters: {'learning_rate': 0.01, 'max_depth': 2, 'n_estimators': 500}
Best AUC ROC score : 0.7859929435483871
Out of the above models, Gradient Boosting Classifier produces the best cross validation ROC AUC score of 0.786 and is hence the best model.
gbrt = GradientBoostingClassifier(learning_rate = 0.01, max_depth = 2,
n_estimators = 500,random_state=0)
gbrt.fit(X_train, y_train)
GradientBoostingClassifier(learning_rate=0.01, max_depth=2, n_estimators=500,
random_state=0)
from sklearn.metrics import roc_auc_score
print('Train ROC AUC score : ', roc_auc_score(y_train, gbrt.predict(X_train)))
print('Validation ROC AUC score : ', roc_auc_score(y_val, gbrt.predict(X_val)))
Train ROC AUC score : 0.6147596153846154
Validation ROC AUC score : 0.6577097505668934
Train data ROC AUC curve
from sklearn import metrics
y_train_pred = gbrt.predict_proba(X_train)[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_train, y_train_pred,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, label='Gradient Boosting Classification (area = %0.4f)' % roc_auc)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
close_zero = np.argmin(np.abs(threshold))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10,
label="threshold default", fillstyle="none", c='k', mew=2)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
Confusion Matrix -Train data
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train, gbrt.predict(X_train))
array([[799, 1],
[120, 36]])
Test data ROC AUC curve
from sklearn import metrics
y_val_pred = gbrt.predict_proba(X_val)[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_val, y_val_pred,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, label='Gradient Boosting Classification (area = %0.4f)' % roc_auc)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
close_zero = np.argmin(np.abs(threshold))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10,
label="threshold default", fillstyle="none", c='k', mew=2)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
Confusion Matrix -Validation data
confusion_matrix(y_val, gbrt.predict(X_val))
array([[267, 3],
[ 33, 16]])
From the confusion matrix and ROC AUC curve with the default threshold, we see that the model is still unable to correctly predict the positives well, although the false positives is very low.
Hence, fine tuning to move the threshold is needed to improve the model's performance.
Since we are dealing with predicting patients with Acute myocardial infarction to reduce the mortality rate, it is more important to correctly predict the positive patients than misclassifying the negative patients.
Hence, On observing the graphs and to keep the model more general, the True Positive rate is aimed at approx. around 0.7
y_train_prob = gbrt.predict_proba(X_train)[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_train, y_train_prob)
revised_threshold = threshold[np.argmin(np.abs(tpr - 0.7))]
y_train_pred = np.where(y_train_prob < revised_threshold,0,1)
confusion_matrix(y_train, y_train_pred)
array([[705, 95],
[ 47, 109]])
print('Train ROC AUC score :' , roc_auc_score(y_train, y_train_pred))
Train ROC AUC score : 0.7899839743589744
from sklearn.metrics import classification_report
print(classification_report(y_train, y_train_pred, target_names=["0", "1"]))
precision recall f1-score support
0 0.94 0.88 0.91 800
1 0.53 0.70 0.61 156
accuracy 0.85 956
macro avg 0.74 0.79 0.76 956
weighted avg 0.87 0.85 0.86 956
y_val_prob = gbrt.predict_proba(X_val)[:,1]
y_val_pred = np.where(y_val_prob < revised_threshold, 0,1)
confusion_matrix(y_val, y_val_pred)
array([[241, 29],
[ 17, 32]])
print('Validation ROC AUC score :',roc_auc_score(y_val, y_val_pred))
Validation ROC AUC score : 0.7728269085411942
The model has significantly improved predictions and is thus a good model. The train and validation ROC scores are also almost similiar and thus the model looks more generic.
test_pred = gbrt.predict_proba(test_data)[:,1]
final_test_prediction = np.where(test_pred < revised_threshold, 0,1)
np.array(np.unique(final_test_prediction, return_counts=True))
array([[ 0, 1],
[335, 90]])
After verifying the test dataset values, ROC AUC score of the test dataset from the above code is 0.751 This indicates that the model is able to generalize well in the unseen dataset as well and is a good model for predicting patients with Mycardial Infraction.