Preprocessing

Mycardial Infraction prediction on patients

Dataset

This dataset is used to predict complications of Myocardial Infarction (MI) based on the information about the patient. The target value 0 is no complication and 1 means complication within the first three days of hospitalization.

MI is one of the most challenging problems of modern medicine. Acute myocardial infarction is associated with high mortality in the first year after it. The incidence of MI remains high in all countries. This is especially true for the urban population of highly developed countries, which is exposed to chronic stress factors, irregular and not always balanced nutrition. In the United States, for example, more than a million people suffer from MI every year, and 200-300 thousand of them die from acute MI before arriving at the hospital. In this regard, predicting complications of myocardial infarction in order to timely carry out the necessary preventive measures is an important task.

Age
Gender
Myocardial: Quantity of myocardial infarctions in the anamnesis – Ordinal
Exertional angina: Exertional angina pectoris in the anamnesis
FC: Functional class (FC) of angina pectoris in the last year – Ordinal
Heart Disease: Coronary heart disease (CHD) in recent weeks, days before admission to hospital
Heredity: Heredity on CHD
Hypertension: Presence of an essential hypertension
Symptomatic hypertension
Duration: Duration of arterial hypertension
Arrhythmia: Observing of arrhythmia in the anamnesis
Systolic_emergency: Systolic blood pressure according to Emergency Cardiology Team
Diastolic_emergency: Diastolic blood pressure according to Emergency Cardiology Team
Systolic_intensive_care: Systolic blood pressure according to intensive care unit
Diastolic_intensive_care: Diastolic blood pressure according to intensive care unit
Potassium: Serum potassium content
Sodium: Serum sodium content
AlAT: Serum AlAT content
AsTK: Serum AsTK content
WBC: White Blood Cell Count
ESR: Erythrocyte sedimentation rate
Time: Time elapsed from the beginning of the attack of CHD to the hospital
Outcome: target column

Preprocessing

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1275 entries, 0 to 1274
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1275 non-null   object
 1   Gender                    1275 non-null   object
 2   myocardial                1275 non-null   object
 3   Exertional angina         1275 non-null   object
 4   FC                        1275 non-null   object
 5   Heart Disease             1275 non-null   object
 6   Heredity                  1275 non-null   object
 7   Hypertension              1275 non-null   object
 8   Symptomatic hypertension  1275 non-null   object
 9   Duration                  1275 non-null   object
 10  Arrhythmia                1275 non-null   object
 11  Systolic_emergency        1275 non-null   object
 12  Diastolic_emergency       1275 non-null   object
 13  Systolic_intensive_care   1275 non-null   object
 14  Diastolic_intensive_care  1275 non-null   object
 15  Potassium                 1275 non-null   object
 16  Sodium                    1275 non-null   object
 17  AlAT                      1275 non-null   object
 18  AsAT                      1275 non-null   object
 19  WBC                       1275 non-null   object
 20  ESR                       1275 non-null   object
 21  Time                      1275 non-null   object
 22  Outcome                   1275 non-null   int64 
dtypes: int64(1), object(22)
memory usage: 229.2+ KB

We can see that all the columns are of type object and hence we don't see any missing values here.

train.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Age	Gender	myocardial	Exertional angina	FC	Heart Disease	Heredity	Hypertension	Symptomatic hypertension	Duration	...	Systolic_intensive_care	Diastolic_intensive_care	Potassium	Sodium	AlAT	AsAT	WBC	ESR	Time	Outcome
0	75	Female	0	Two years ago	II FC	Exertional angina	?	Stage 2	No	?	...	140	90	?	?	0.3	0.18	7.8	16	7	0
1	50	Male	1	Two years ago	II FC	Unstable angina	?	Stage 2	No	One year	...	?	?	3.9	132	0.23	0.52	6.2	20	7	0
2	54	Male	0	Never	No angina	No angina	?	No	No	No hypertension	...	140	100	?	?	?	?	6.9	6	?	0
3	51	Male	?	?	?	Unstable angina	?	?	?	?	...	0	0	?	?	?	?	?	?	2	1
4	76	Female	3	Never	No angina	Unstable angina	?	Stage 2	No	More than 10 years	...	110	70	?	?	0.15	0.26	4	5	7	0

5 rows × 23 columns

From observing the top 5 rows, we see that a lot of columns have '?' in it indicating missing values.

Replacing ? with NaN

Replacing '?' across the columns as NaN

for col in train.columns:
    train[col] = train[col].replace('?',np.nan)

Converting numerical columns to float

train['Age'] = train['Age'].astype(float)
train['myocardial'] = train['myocardial'].astype(float)

train[train.columns[11:22]] = train[train.columns[11:22]].astype(float)

train.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Age	myocardial	Systolic_emergency	Diastolic_emergency	Systolic_intensive_care	Diastolic_intensive_care	Potassium	Sodium	AlAT	AsAT	WBC	ESR	Time	Outcome
count	1274.000000	1274.00000	474.000000	474.000000	1066.000000	1066.000000	993.000000	992.000000	1067.000000	1066.000000	1188.000000	1132.000000	1178.000000	1275.000000
mean	64.154631	0.56044	137.700422	82.004219	134.812383	83.076923	4.194361	136.607863	0.472671	0.262336	8.843519	13.475265	4.702037	0.160784
std	46.793076	0.83419	34.681988	19.997145	31.734114	18.631784	0.770440	6.598662	0.386188	0.206220	3.449176	10.796416	2.858370	0.367476
min	26.000000	0.00000	0.000000	0.000000	0.000000	0.000000	2.300000	117.000000	0.030000	0.040000	2.000000	1.000000	1.000000	0.000000
25%	54.000000	0.00000	120.000000	70.000000	120.000000	80.000000	3.700000	133.000000	0.230000	0.150000	6.400000	5.000000	2.000000	0.000000
50%	63.000000	0.00000	140.000000	80.000000	130.000000	80.000000	4.100000	136.000000	0.380000	0.220000	8.100000	10.000000	4.000000	0.000000
75%	70.000000	1.00000	160.000000	90.000000	150.000000	90.000000	4.600000	140.000000	0.610000	0.300000	10.500000	19.000000	7.000000	0.000000
max	999.000000	3.00000	260.000000	190.000000	260.000000	190.000000	8.000000	169.000000	3.000000	2.150000	27.900000	68.000000	9.000000	1.000000

Replacing 999 in the Age column as NaN

train['Age'] = train['Age'].replace(999,np.nan)

train.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Age	myocardial	Systolic_emergency	Diastolic_emergency	Systolic_intensive_care	Diastolic_intensive_care	Potassium	Sodium	AlAT	AsAT	WBC	ESR	Time	Outcome
count	1271.000000	1274.00000	474.000000	474.000000	1066.000000	1066.000000	993.000000	992.000000	1067.000000	1066.000000	1188.000000	1132.000000	1178.000000	1275.000000
mean	61.948072	0.56044	137.700422	82.004219	134.812383	83.076923	4.194361	136.607863	0.472671	0.262336	8.843519	13.475265	4.702037	0.160784
std	11.201609	0.83419	34.681988	19.997145	31.734114	18.631784	0.770440	6.598662	0.386188	0.206220	3.449176	10.796416	2.858370	0.367476
min	26.000000	0.00000	0.000000	0.000000	0.000000	0.000000	2.300000	117.000000	0.030000	0.040000	2.000000	1.000000	1.000000	0.000000
25%	54.000000	0.00000	120.000000	70.000000	120.000000	80.000000	3.700000	133.000000	0.230000	0.150000	6.400000	5.000000	2.000000	0.000000
50%	63.000000	0.00000	140.000000	80.000000	130.000000	80.000000	4.100000	136.000000	0.380000	0.220000	8.100000	10.000000	4.000000	0.000000
75%	70.000000	1.00000	160.000000	90.000000	150.000000	90.000000	4.600000	140.000000	0.610000	0.300000	10.500000	19.000000	7.000000	0.000000
max	92.000000	3.00000	260.000000	190.000000	260.000000	190.000000	8.000000	169.000000	3.000000	2.150000	27.900000	68.000000	9.000000	1.000000

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1275 entries, 0 to 1274
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       1271 non-null   float64
 1   Gender                    1275 non-null   object 
 2   myocardial                1274 non-null   float64
 3   Exertional angina         1202 non-null   object 
 4   FC                        1225 non-null   object 
 5   Heart Disease             1235 non-null   object 
 6   Heredity                  57 non-null     object 
 7   Hypertension              1271 non-null   object 
 8   Symptomatic hypertension  1272 non-null   object 
 9   Duration                  1085 non-null   object 
 10  Arrhythmia                1261 non-null   object 
 11  Systolic_emergency        474 non-null    float64
 12  Diastolic_emergency       474 non-null    float64
 13  Systolic_intensive_care   1066 non-null   float64
 14  Diastolic_intensive_care  1066 non-null   float64
 15  Potassium                 993 non-null    float64
 16  Sodium                    992 non-null    float64
 17  AlAT                      1067 non-null   float64
 18  AsAT                      1066 non-null   float64
 19  WBC                       1188 non-null   float64
 20  ESR                       1132 non-null   float64
 21  Time                      1178 non-null   float64
 22  Outcome                   1275 non-null   int64  
dtypes: float64(13), int64(1), object(9)
memory usage: 229.2+ KB

Test dataset

for col in test.columns:
    test[col] = test[col].replace('?',np.nan) 

#Converting numerical columns to float 

test['Age'] = test['Age'].astype(float)
test['myocardial'] = test['myocardial'].astype(float)

test[test.columns[11:22]] = test[test.columns[11:22]].astype(float)

test['Age'] = test['Age'].replace(999,np.nan)

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 425 entries, 0 to 424
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       421 non-null    float64
 1   Gender                    425 non-null    object 
 2   myocardial                422 non-null    float64
 3   Exertional angina         392 non-null    object 
 4   FC                        402 non-null    object 
 5   Heart Disease             414 non-null    object 
 6   Heredity                  15 non-null     object 
 7   Hypertension              420 non-null    object 
 8   Symptomatic hypertension  420 non-null    object 
 9   Duration                  367 non-null    object 
 10  Arrhythmia                418 non-null    object 
 11  Systolic_emergency        150 non-null    float64
 12  Diastolic_emergency       150 non-null    float64
 13  Systolic_intensive_care   367 non-null    float64
 14  Diastolic_intensive_care  367 non-null    float64
 15  Potassium                 336 non-null    float64
 16  Sodium                    333 non-null    float64
 17  AlAT                      349 non-null    float64
 18  AsAT                      349 non-null    float64
 19  WBC                       387 non-null    float64
 20  ESR                       365 non-null    float64
 21  Time                      396 non-null    float64
dtypes: float64(13), object(9)
memory usage: 73.2+ KB

Exploratory analysis

Observing the Missing values

train.isnull().any(axis= 'columns').sum()

train.isnull().sum()

Age                            4
Gender                         0
myocardial                     1
Exertional angina             73
FC                            50
Heart Disease                 40
Heredity                    1218
Hypertension                   4
Symptomatic hypertension       3
Duration                     190
Arrhythmia                    14
Systolic_emergency           801
Diastolic_emergency          801
Systolic_intensive_care      209
Diastolic_intensive_care     209
Potassium                    282
Sodium                       283
AlAT                         208
AsAT                         209
WBC                           87
ESR                          143
Time                          97
Outcome                        0
dtype: int64

Age

train.Age.hist()
plt.title('Age histogram')

Text(0.5, 1.0, 'Age histogram')

The Age histogram is approximately normally distributed with Mean ~ 62 and Median ~ 63 and will hence use Median imputation

train[['Age','Outcome']].groupby('Outcome').median()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Age
Outcome
0	62.0
1	67.0

We see that mean Age for people tested to have MI complication is greater than people who have no MI complication.

train[train['Age'].isnull()]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Age	Gender	myocardial	Exertional angina	FC	Heart Disease	Heredity	Hypertension	Symptomatic hypertension	Duration	...	Systolic_intensive_care	Diastolic_intensive_care	Potassium	Sodium	AlAT	AsAT	WBC	ESR	Time
248	NaN	Male	0.0	Two years ago	II FC	Unstable angina	NaN	Stage 2	No	More than 10 years	...	NaN	NaN	4.7	142.0	0.30	0.07	10.3	17.0	6.0
295	NaN	Male	0.0	Two years ago	II FC	Unstable angina	NaN	No	No	No hypertension	...	NaN	NaN	3.9	131.0	0.45	0.30	12.7	9.0	3.0
547	NaN	Female	2.0	More than five years ago	II FC	Exertional angina	NaN	Stage 2	No	Six to ten years	...	100.0	60.0	4.6	132.0	0.75	0.22	5.6	14.0	3.0
729	NaN	Male	0.0	Never	No angina	No angina	NaN	No	No	No hypertension	...	140.0	90.0	NaN	NaN	NaN	NaN	15.5	10.0	8.0

4 rows × 23 columns

train[['Age','Gender']].groupby('Gender').median()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Age
Gender
Female	68.0
Male	59.0

Median age in the dataset for Female is 68 and Male is 59.

Age imputation will be done maintaining this distribution

train['Age'] = train['Age'].fillna(train.groupby('Gender')['Age'].transform('median'))

test['Age'] = test['Age'].fillna(train.groupby('Gender')['Age'].transform('median'))

Gender

train.Gender.value_counts().plot(kind = 'bar', title = 'Gender counts')

<AxesSubplot:title={'center':'Gender counts'}>

We can see that there are more Male than Female in the dataset. Mapping Male to 1 and Female to 0

train['Gender']=train['Gender'].map({'Female':0,'Male':1}).astype(int)

test['Gender']=test['Gender'].map({'Female':0,'Male':1}).astype(int)

Myocardial - ordinal

train['myocardial'].value_counts().plot(kind = 'bar', title = 'Myocardial counts')

<AxesSubplot:title={'center':'Myocardial counts'}>

We see that 0 is the dominant value

Examining Heart Disease and Myocardial value to check if it has a variation.

train[['Heart Disease','myocardial']].groupby('Heart Disease').agg(pd.Series.mode)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	myocardial
Heart Disease
Exertional angina	0.0
No angina	0.0
Unstable angina	0.0

We don't see a difference and is hence not useful.

Imputing mode = 0 to the missing value

train['myocardial'] = train['myocardial'].fillna(0).astype(int)

test['myocardial'] = test['myocardial'] .fillna(0).astype(int)

Exertional Angina

train['Exertional angina'].value_counts()

Never                       504
More than five years ago    254
During the last year        105
One year ago                103
Two years ago                92
Four to five years ago       91
Three years ago              53
Name: Exertional angina, dtype: int64

train['Exertional angina'].value_counts().plot(kind = 'bar', title = 'Exertional Angina counts')

<AxesSubplot:title={'center':'Exertional Angina counts'}>

#mode imputation and one-hot vector encoding - train data
train['Exertional angina'].replace(np.NaN, 'Never', inplace = True)

cols = pd.get_dummies(train['Exertional angina'], prefix = 'Exertionalangina')
train[cols.columns] = cols
train.drop('Exertional angina', axis = 1, inplace = True)

#mode imputation and one-hot vector encoding - test data
test['Exertional angina'].replace(np.NaN, 'Never', inplace = True)

cols = pd.get_dummies(test['Exertional angina'],prefix = 'Exertionalangina')
test[cols.columns] = cols
test.drop('Exertional angina', axis = 1, inplace = True)

Functional Class - ordinal

train['FC'].value_counts()

II FC        639
No angina    503
III FC        38
I FC          37
IV FC          8
Name: FC, dtype: int64

train['FC'].value_counts().plot(kind = 'bar', title = 'Functional class count')

<AxesSubplot:title={'center':'Functional class count'}>

Replacing NaN with most frequent value and mapping the categorical value to numerical,thereby making it ordinal

train['FC'].replace(np.NaN, 'II FC', inplace = True)

train['FC'] = train['FC'].map({'No angina':0,
                              'I FC':1,
                              'II FC':2,
                              'III FC':3,
                              'IV FC':4 }).astype(int)

test['FC'].replace(np.NaN, 'II FC', inplace = True)

test['FC'] = test['FC'].map({'No angina':0,
                              'I FC':1,
                              'II FC':2,
                              'III FC':3,
                              'IV FC':4 }).astype(int)

Heart Disease

train['Heart Disease'].value_counts().plot(kind = 'bar', title = 'Heart Disease counts' )

<AxesSubplot:title={'center':'Heart Disease counts'}>

Replacing NaN with most common category and one-hot encoding the column

train['Heart Disease'].fillna('Unstable angina', inplace = True)

cols = pd.get_dummies(train['Heart Disease'], prefix = 'HeartDisease')
train[cols.columns] = cols
train.drop('Heart Disease', axis = 1, inplace = True)

test['Heart Disease'].fillna('Unstable angina', inplace = True)

cols = pd.get_dummies(test['Heart Disease'], prefix = 'HeartDisease')
test[cols.columns] = cols
test.drop('Heart Disease', axis = 1, inplace = True)

Hypertension

train['Hypertension'].unique()

array(['Stage 2', 'No', nan, 'Stage 3', 'Stage 1'], dtype=object)

train['Hypertension'].value_counts()

Stage 2    670
No         445
Stage 3    148
Stage 1      8
Name: Hypertension, dtype: int64

train['Hypertension'].fillna('Stage 2', inplace = True)

cols = pd.get_dummies(train['Hypertension'], prefix = 'Hypertension')
train[cols.columns] = cols
train.drop('Hypertension', axis = 1, inplace = True)

test['Hypertension'].fillna('Stage 2', inplace = True)

cols = pd.get_dummies(test['Hypertension'],prefix = 'Hypertension')
test[cols.columns] = cols
test.drop('Hypertension', axis = 1, inplace = True)

Symptomatic hypertension

train['Symptomatic hypertension'].unique()

array(['No', nan, 'Yes'], dtype=object)

train['Symptomatic hypertension'].value_counts().plot(kind = 'bar', title = 'Symptomatic Hypertension count')

<AxesSubplot:title={'center':'Symptomatic Hypertension count'}>

Most of the values are No. Replacing NA's with 'No' and encoding most frequent value 'No' as 1 and 'Yes' as 0

train['Symptomatic hypertension'].fillna('No', inplace = True)

train['Symptomatic hypertension'] = train['Symptomatic hypertension'].map({'No':1,'Yes':0})

test['Symptomatic hypertension'].fillna('No', inplace = True)

test['Symptomatic hypertension'] = test['Symptomatic hypertension'].map({'No':1,'Yes':0})

test['Symptomatic hypertension'].value_counts()

1    417
0      8
Name: Symptomatic hypertension, dtype: int64

Duration

train['Duration'].unique()

array([nan, 'One year', 'No hypertension', 'More than 10 years',
       'Six to ten years', 'Three years', 'Five years', 'Four years',
       'Two years'], dtype=object)

train['Duration'].value_counts().plot(kind = 'bar', title = 'Duration counts')

<AxesSubplot:title={'center':'Duration counts'}>

Replacing NAs with 'No hypertension' and one-hot encoding the column

train['Duration'].fillna('No hypertension', inplace = True)

cols = pd.get_dummies(train['Duration'],prefix = 'Duration')
train[cols.columns] = cols
train.drop('Duration', axis = 1, inplace = True)

test['Duration'].fillna('No hypertension', inplace = True)

cols = pd.get_dummies(test['Duration'],prefix = 'Duration')
test[cols.columns] = cols
test.drop('Duration', axis = 1, inplace = True)

Arrhythmia

train['Arrhythmia'].unique()

array(['No', nan, 'Yes'], dtype=object)

train['Arrhythmia'].value_counts().plot(kind = 'bar', title = 'Arrhythmia counts')

<AxesSubplot:title={'center':'Arrhythmia counts'}>

Replacing NAs with 'No' and mapping 'No' to 1 and 'Yes' to 0

train['Arrhythmia'].fillna('No', inplace = True)

train['Arrhythmia']= train['Arrhythmia'].map({'No':1,'Yes':0})

test['Arrhythmia'].fillna('No', inplace = True)

test['Arrhythmia']= test['Arrhythmia'].map({'No':1,'Yes':0})

Systolic intensive care

train['Systolic_intensive_care'].describe()

count    1066.000000
mean      134.812383
std        31.734114
min         0.000000
25%       120.000000
50%       130.000000
75%       150.000000
max       260.000000
Name: Systolic_intensive_care, dtype: float64

train['Systolic_intensive_care'].hist()

<AxesSubplot:>

Replacing NAs with Median

train['Systolic_intensive_care'].fillna(train['Systolic_intensive_care'].median(), inplace = True)

test['Systolic_intensive_care'].fillna(train['Systolic_intensive_care'].median(), inplace = True)

Diastolic intensive care

train['Diastolic_intensive_care'].describe()

count    1066.000000
mean       83.076923
std        18.631784
min         0.000000
25%        80.000000
50%        80.000000
75%        90.000000
max       190.000000
Name: Diastolic_intensive_care, dtype: float64

train['Diastolic_intensive_care'].hist()

<AxesSubplot:>

Replacing NAs with median

train['Diastolic_intensive_care'].fillna(train['Diastolic_intensive_care'].median(), inplace = True)

test['Diastolic_intensive_care'].fillna(train['Diastolic_intensive_care'].median(), inplace = True)

Potassium

train['Potassium'].describe()

count    993.000000
mean       4.194361
std        0.770440
min        2.300000
25%        3.700000
50%        4.100000
75%        4.600000
max        8.000000
Name: Potassium, dtype: float64

train['Potassium'].hist()

<AxesSubplot:>

train['Potassium'].fillna(train['Potassium'].median(), inplace = True)

test['Potassium'].fillna(train['Potassium'].median(), inplace = True)

Sodium

train['Sodium'].describe()

count    992.000000
mean     136.607863
std        6.598662
min      117.000000
25%      133.000000
50%      136.000000
75%      140.000000
max      169.000000
Name: Sodium, dtype: float64

train['Sodium'].hist()

<AxesSubplot:>

train['Sodium'].fillna(train['Sodium'].median(), inplace = True)

test['Sodium'].fillna(train['Sodium'].median(), inplace = True)

AlAT

train['AlAT'].describe()

count    1067.000000
mean        0.472671
std         0.386188
min         0.030000
25%         0.230000
50%         0.380000
75%         0.610000
max         3.000000
Name: AlAT, dtype: float64

train['AlAT'].hist()

<AxesSubplot:>

train['AlAT'].fillna(train['AlAT'].median(), inplace = True)

test['AlAT'].fillna(train['AlAT'].median(), inplace = True)

AsTK

train['AsAT'].describe()

count    1066.000000
mean        0.262336
std         0.206220
min         0.040000
25%         0.150000
50%         0.220000
75%         0.300000
max         2.150000
Name: AsAT, dtype: float64

train['AsAT'].hist()

<AxesSubplot:>

train['AsAT'].fillna(train['AsAT'].median(), inplace = True)

test['AsAT'].fillna(train['AsAT'].median(), inplace = True)

WBC

train['WBC'].describe()

count    1188.000000
mean        8.843519
std         3.449176
min         2.000000
25%         6.400000
50%         8.100000
75%        10.500000
max        27.900000
Name: WBC, dtype: float64

train['WBC'].hist()

<AxesSubplot:>

train['WBC'].fillna(train['WBC'].median(), inplace = True)

test['WBC'].fillna(train['WBC'].median(), inplace = True)

ESR

train['ESR'].describe()

count    1132.000000
mean       13.475265
std        10.796416
min         1.000000
25%         5.000000
50%        10.000000
75%        19.000000
max        68.000000
Name: ESR, dtype: float64

train['ESR'].hist()

<AxesSubplot:>

train['ESR'].fillna(train['ESR'].median(), inplace = True)

test['ESR'].fillna(train['ESR'].median(), inplace = True)

Time

train['Time'].describe()

count    1178.000000
mean        4.702037
std         2.858370
min         1.000000
25%         2.000000
50%         4.000000
75%         7.000000
max         9.000000
Name: Time, dtype: float64

train['Time'].hist()

<AxesSubplot:>

train['Time'].fillna(train['Time'].median(), inplace = True)

test['Time'].fillna(train['Time'].median(), inplace = True)

Outcome

train['Outcome'].value_counts().plot(kind = 'bar', title = 'Outcome Counts')

<AxesSubplot:title={'center':'Outcome Counts'}>

We see that the value 0 is dominant, i.e., there are more 0's than 1's and the data is an unbalanced dataset

Dropping other columns having more than 60% missing values

Heredity

train.drop('Heredity', axis = 1, inplace = True)

test.drop('Heredity', axis = 1, inplace = True)

Systolic emergency

train.drop('Systolic_emergency', axis = 1, inplace = True)

test.drop('Systolic_emergency', axis = 1, inplace = True)

Diastolic emergency

train.drop('Diastolic_emergency', axis = 1, inplace = True)

test.drop('Diastolic_emergency', axis = 1, inplace = True)

Final test dataset with no missing values

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 425 entries, 0 to 424
Data columns (total 37 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Age                                        425 non-null    float64
 1   Gender                                     425 non-null    int64  
 2   myocardial                                 425 non-null    int64  
 3   FC                                         425 non-null    int64  
 4   Symptomatic hypertension                   425 non-null    int64  
 5   Arrhythmia                                 425 non-null    int64  
 6   Systolic_intensive_care                    425 non-null    float64
 7   Diastolic_intensive_care                   425 non-null    float64
 8   Potassium                                  425 non-null    float64
 9   Sodium                                     425 non-null    float64
 10  AlAT                                       425 non-null    float64
 11  AsAT                                       425 non-null    float64
 12  WBC                                        425 non-null    float64
 13  ESR                                        425 non-null    float64
 14  Time                                       425 non-null    float64
 15  Exertionalangina_During the last year      425 non-null    uint8  
 16  Exertionalangina_Four to five years ago    425 non-null    uint8  
 17  Exertionalangina_More than five years ago  425 non-null    uint8  
 18  Exertionalangina_Never                     425 non-null    uint8  
 19  Exertionalangina_One year ago              425 non-null    uint8  
 20  Exertionalangina_Three years ago           425 non-null    uint8  
 21  Exertionalangina_Two years ago             425 non-null    uint8  
 22  HeartDisease_Exertional angina             425 non-null    uint8  
 23  HeartDisease_No angina                     425 non-null    uint8  
 24  HeartDisease_Unstable angina               425 non-null    uint8  
 25  Hypertension_No                            425 non-null    uint8  
 26  Hypertension_Stage 1                       425 non-null    uint8  
 27  Hypertension_Stage 2                       425 non-null    uint8  
 28  Hypertension_Stage 3                       425 non-null    uint8  
 29  Duration_Five years                        425 non-null    uint8  
 30  Duration_Four years                        425 non-null    uint8  
 31  Duration_More than 10 years                425 non-null    uint8  
 32  Duration_No hypertension                   425 non-null    uint8  
 33  Duration_One year                          425 non-null    uint8  
 34  Duration_Six to ten years                  425 non-null    uint8  
 35  Duration_Three years                       425 non-null    uint8  
 36  Duration_Two years                         425 non-null    uint8  
dtypes: float64(10), int64(5), uint8(22)
memory usage: 59.1 KB

Final train dataset with no missing values

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1275 entries, 0 to 1274
Data columns (total 38 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Age                                        1275 non-null   float64
 1   Gender                                     1275 non-null   int64  
 2   myocardial                                 1275 non-null   int64  
 3   FC                                         1275 non-null   int64  
 4   Symptomatic hypertension                   1275 non-null   int64  
 5   Arrhythmia                                 1275 non-null   int64  
 6   Systolic_intensive_care                    1275 non-null   float64
 7   Diastolic_intensive_care                   1275 non-null   float64
 8   Potassium                                  1275 non-null   float64
 9   Sodium                                     1275 non-null   float64
 10  AlAT                                       1275 non-null   float64
 11  AsAT                                       1275 non-null   float64
 12  WBC                                        1275 non-null   float64
 13  ESR                                        1275 non-null   float64
 14  Time                                       1275 non-null   float64
 15  Outcome                                    1275 non-null   int64  
 16  Exertionalangina_During the last year      1275 non-null   uint8  
 17  Exertionalangina_Four to five years ago    1275 non-null   uint8  
 18  Exertionalangina_More than five years ago  1275 non-null   uint8  
 19  Exertionalangina_Never                     1275 non-null   uint8  
 20  Exertionalangina_One year ago              1275 non-null   uint8  
 21  Exertionalangina_Three years ago           1275 non-null   uint8  
 22  Exertionalangina_Two years ago             1275 non-null   uint8  
 23  HeartDisease_Exertional angina             1275 non-null   uint8  
 24  HeartDisease_No angina                     1275 non-null   uint8  
 25  HeartDisease_Unstable angina               1275 non-null   uint8  
 26  Hypertension_No                            1275 non-null   uint8  
 27  Hypertension_Stage 1                       1275 non-null   uint8  
 28  Hypertension_Stage 2                       1275 non-null   uint8  
 29  Hypertension_Stage 3                       1275 non-null   uint8  
 30  Duration_Five years                        1275 non-null   uint8  
 31  Duration_Four years                        1275 non-null   uint8  
 32  Duration_More than 10 years                1275 non-null   uint8  
 33  Duration_No hypertension                   1275 non-null   uint8  
 34  Duration_One year                          1275 non-null   uint8  
 35  Duration_Six to ten years                  1275 non-null   uint8  
 36  Duration_Three years                       1275 non-null   uint8  
 37  Duration_Two years                         1275 non-null   uint8  
dtypes: float64(10), int64(6), uint8(22)
memory usage: 186.9 KB

#Visualizing few columns

from pandas.plotting import scatter_matrix
X = train.drop('Outcome', axis = 1)
y = train['Outcome']

attributes = X.columns[6:11]
scatter_matrix(X[attributes], figsize = (15,15), c = y, alpha = 0.8, marker = 'O')

array([[<AxesSubplot:xlabel='Systolic_intensive_care', ylabel='Systolic_intensive_care'>,
        <AxesSubplot:xlabel='Diastolic_intensive_care', ylabel='Systolic_intensive_care'>,
        <AxesSubplot:xlabel='Potassium', ylabel='Systolic_intensive_care'>,
        <AxesSubplot:xlabel='Sodium', ylabel='Systolic_intensive_care'>,
        <AxesSubplot:xlabel='AlAT', ylabel='Systolic_intensive_care'>],
       [<AxesSubplot:xlabel='Systolic_intensive_care', ylabel='Diastolic_intensive_care'>,
        <AxesSubplot:xlabel='Diastolic_intensive_care', ylabel='Diastolic_intensive_care'>,
        <AxesSubplot:xlabel='Potassium', ylabel='Diastolic_intensive_care'>,
        <AxesSubplot:xlabel='Sodium', ylabel='Diastolic_intensive_care'>,
        <AxesSubplot:xlabel='AlAT', ylabel='Diastolic_intensive_care'>],
       [<AxesSubplot:xlabel='Systolic_intensive_care', ylabel='Potassium'>,
        <AxesSubplot:xlabel='Diastolic_intensive_care', ylabel='Potassium'>,
        <AxesSubplot:xlabel='Potassium', ylabel='Potassium'>,
        <AxesSubplot:xlabel='Sodium', ylabel='Potassium'>,
        <AxesSubplot:xlabel='AlAT', ylabel='Potassium'>],
       [<AxesSubplot:xlabel='Systolic_intensive_care', ylabel='Sodium'>,
        <AxesSubplot:xlabel='Diastolic_intensive_care', ylabel='Sodium'>,
        <AxesSubplot:xlabel='Potassium', ylabel='Sodium'>,
        <AxesSubplot:xlabel='Sodium', ylabel='Sodium'>,
        <AxesSubplot:xlabel='AlAT', ylabel='Sodium'>],
       [<AxesSubplot:xlabel='Systolic_intensive_care', ylabel='AlAT'>,
        <AxesSubplot:xlabel='Diastolic_intensive_care', ylabel='AlAT'>,
        <AxesSubplot:xlabel='Potassium', ylabel='AlAT'>,
        <AxesSubplot:xlabel='Sodium', ylabel='AlAT'>,
        <AxesSubplot:xlabel='AlAT', ylabel='AlAT'>]], dtype=object)

Machine learning models

Since the dataset is unbalanced with the Outcome variable having less 1's and more 0's, accuracy will not be a good predictor.

Recall oriented scoring such as F1 or ROC AUC will be needed.

ROC AUC scoring will be used for model evaluation.

X = train.drop('Outcome', axis = 1)
y = train['Outcome']

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

X_train_org, X_val_org, y_train, y_val = train_test_split(X,y, random_state = 0)

#Scaling the dataset using MinMax Scaler

scaler = MinMaxScaler()
colnames = X.columns
X_train = scaler.fit_transform(X_train_org)
X_val = scaler.transform(X_val_org)
test_data = scaler.transform(test)

KNN Classifier

from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier


param_grid = { 'n_neighbors' : range(1,20) }

knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv = 5, scoring = 'roc_auc').fit(X_train,y_train)

print('Cross validation score : ',grid_search.best_score_)
print('Best parameters : ',grid_search.best_params_)

Cross validation score :  0.6725050403225807
Best parameters :  {'n_neighbors': 14}

Logistic

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

param_grid1 = { 'C' : [0.01,0.1,1,10,100],
              'max_iter' : [100,250,500]}

logit = LogisticRegression(solver = 'lbfgs',random_state = 0)
grid_log = GridSearchCV(logit, param_grid1, cv = 5, scoring = 'roc_auc', n_jobs = -1).fit(X_train,y_train)

print('Best parameters',grid_log.best_params_)
print('Best ROC_AUC score', grid_log.best_score_)

Best parameters {'C': 1, 'max_iter': 100}
Best ROC_AUC score 0.7442716733870969

Support Vector Classifier

from sklearn.svm import SVC

param_grid3 = {'C': [0.01, 0.1, 1, 10, 100],
              'gamma': [0.01, 0.1, 1, 10, 100]}

rbf = SVC(kernel = 'rbf', random_state = 0)

grid_rbf = GridSearchCV(rbf, param_grid3,cv =5, n_jobs = -1, scoring = 'roc_auc').fit(X_train,y_train)

print('Best parameters:', grid_rbf.best_params_)
print('Best AUC ROC score : ', grid_rbf.best_score_)

Best parameters: {'C': 1, 'gamma': 0.1}
Best AUC ROC score :  0.7363193044354839

Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

param_grid4 = {'max_depth': [2,3,4,5,6,7,8,9,10,11] }

tree = DecisionTreeClassifier(random_state = 0)

grid_tree = GridSearchCV(tree, param_grid4,cv =5, n_jobs = -1, scoring = 'roc_auc').fit(X_train,y_train)

print('Best parameters:', grid_tree.best_params_)
print('Best AUC ROC score : ', grid_tree.best_score_)

Best parameters: {'max_depth': 5}
Best AUC ROC score :  0.696522177419355

Bagging Classifier

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression

param_grid = {'max_samples':[0.01, 0.05, 0.1,0.5,1], 
              'max_features':[0.01, 0.05, 0.1,0.5,1],
             'bootstrap' : [True, False]}

log = LogisticRegression(C = 1, max_iter = 100,penalty = 'l2', solver = 'lbfgs', random_state = 0)
bg = BaggingClassifier(log, random_state = 0)
grid_bg = GridSearchCV(bg, param_grid = param_grid, cv = 5,scoring='roc_auc', n_jobs = -1).fit(X_train,y_train)

print('Best parameters:', grid_bg.best_params_)
print('Best AUC ROC score : ', grid_bg.best_score_)

Best parameters: {'bootstrap': False, 'max_features': 0.1, 'max_samples': 0.5}
Best AUC ROC score :  0.7519468245967742

Gradient Boosting Classifier

from sklearn.ensemble import GradientBoostingClassifier

param_grid2 = {'learning_rate':[0.001, 0.01, 0.1,0.5],
              'n_estimators':[100,200, 500, 1000],
              'max_depth' : [2,3,4,5]}

gbrt = GradientBoostingClassifier(random_state=0)
grid_gb = GridSearchCV(gbrt, param_grid2 ,cv = 5, scoring = 'roc_auc', n_jobs = -1).fit(X_train,y_train)

print('Best parameters:', grid_gb.best_params_)
print('Best AUC ROC score : ', grid_gb.best_score_)

Best parameters: {'learning_rate': 0.01, 'max_depth': 2, 'n_estimators': 500}
Best AUC ROC score :  0.7859929435483871

Best model

Out of the above models, Gradient Boosting Classifier produces the best cross validation ROC AUC score of 0.786 and is hence the best model.

gbrt = GradientBoostingClassifier(learning_rate = 0.01, max_depth = 2, 
                                  n_estimators = 500,random_state=0)
gbrt.fit(X_train, y_train)

GradientBoostingClassifier(learning_rate=0.01, max_depth=2, n_estimators=500,
                           random_state=0)

from sklearn.metrics import roc_auc_score
print('Train ROC AUC score : ', roc_auc_score(y_train, gbrt.predict(X_train)))
print('Validation ROC AUC score : ', roc_auc_score(y_val, gbrt.predict(X_val)))

Train ROC AUC score :  0.6147596153846154
Validation ROC AUC score :  0.6577097505668934

ROC AUC visualization

Train data ROC AUC curve

from sklearn import metrics

y_train_pred = gbrt.predict_proba(X_train)[:,1]

fpr, tpr, threshold = metrics.roc_curve(y_train, y_train_pred,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, label='Gradient Boosting Classification (area = %0.4f)' % roc_auc)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
close_zero = np.argmin(np.abs(threshold))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10,
         label="threshold default", fillstyle="none", c='k', mew=2)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

Confusion Matrix -Train data

from sklearn.metrics import confusion_matrix

confusion_matrix(y_train, gbrt.predict(X_train))

array([[799,   1],
       [120,  36]])

Test data ROC AUC curve

from sklearn import metrics

y_val_pred = gbrt.predict_proba(X_val)[:,1]

fpr, tpr, threshold = metrics.roc_curve(y_val, y_val_pred,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, label='Gradient Boosting Classification (area = %0.4f)' % roc_auc)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
close_zero = np.argmin(np.abs(threshold))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10,
         label="threshold default", fillstyle="none", c='k', mew=2)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

Confusion Matrix -Validation data

confusion_matrix(y_val, gbrt.predict(X_val))

array([[267,   3],
       [ 33,  16]])

From the confusion matrix and ROC AUC curve with the default threshold, we see that the model is still unable to correctly predict the positives well, although the false positives is very low.

Hence, fine tuning to move the threshold is needed to improve the model's performance.

Post prediction Model tuning

Since we are dealing with predicting patients with Acute myocardial infarction to reduce the mortality rate, it is more important to correctly predict the positive patients than misclassifying the negative patients.

Hence, On observing the graphs and to keep the model more general, the True Positive rate is aimed at approx. around 0.7

y_train_prob = gbrt.predict_proba(X_train)[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_train, y_train_prob)

revised_threshold = threshold[np.argmin(np.abs(tpr - 0.7))]

y_train_pred = np.where(y_train_prob < revised_threshold,0,1)

confusion_matrix(y_train, y_train_pred)

array([[705,  95],
       [ 47, 109]])

print('Train ROC AUC score :' , roc_auc_score(y_train, y_train_pred))

Train ROC AUC score : 0.7899839743589744

from sklearn.metrics import classification_report

print(classification_report(y_train, y_train_pred, target_names=["0", "1"]))

              precision    recall  f1-score   support

           0       0.94      0.88      0.91       800
           1       0.53      0.70      0.61       156

    accuracy                           0.85       956
   macro avg       0.74      0.79      0.76       956
weighted avg       0.87      0.85      0.86       956

y_val_prob = gbrt.predict_proba(X_val)[:,1]

y_val_pred = np.where(y_val_prob < revised_threshold, 0,1)

confusion_matrix(y_val, y_val_pred)

array([[241,  29],
       [ 17,  32]])

print('Validation ROC AUC score :',roc_auc_score(y_val, y_val_pred))

Validation ROC AUC score : 0.7728269085411942

The model has significantly improved predictions and is thus a good model. The train and validation ROC scores are also almost similiar and thus the model looks more generic.

test_pred = gbrt.predict_proba(test_data)[:,1]
final_test_prediction = np.where(test_pred < revised_threshold, 0,1)

np.array(np.unique(final_test_prediction, return_counts=True))

array([[  0,   1],
       [335,  90]])

After verifying the test dataset values, ROC AUC score of the test dataset from the above code is 0.751 This indicates that the model is able to generalize well in the unseen dataset as well and is a good model for predicting patients with Mycardial Infraction.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.DS_Store		.DS_Store
.gitattributes		.gitattributes
MI prediction model.ipynb		MI prediction model.ipynb
README.md		README.md
output_100_1.png		output_100_1.png
output_108_1.png		output_108_1.png
output_114_1.png		output_114_1.png
output_120_1.png		output_120_1.png
output_125_1.png		output_125_1.png
output_130_1.png		output_130_1.png
output_135_1.png		output_135_1.png
output_140_1.png		output_140_1.png
output_145_1.png		output_145_1.png
output_150_1.png		output_150_1.png
output_154_1.png		output_154_1.png
output_169_1.png		output_169_1.png
output_200_0.png		output_200_0.png
output_204_0.png		output_204_0.png
output_29_1.png		output_29_1.png
output_39_1.png		output_39_1.png
output_44_1.png		output_44_1.png
output_54_1.png		output_54_1.png
output_61_1.png		output_61_1.png
output_68_1.png		output_68_1.png
output_83_1.png		output_83_1.png
output_92_1.png		output_92_1.png

Sushama-Rangarajan/MI-Prediction

Folders and files

Latest commit

History

Repository files navigation

Mycardial Infraction prediction on patients

Dataset

Preprocessing

Replacing ? with NaN

Replacing 999 in the Age column as NaN

Test dataset

Exploratory analysis

Age

Gender

Myocardial - ordinal

Exertional Angina

Functional Class - ordinal

Heart Disease

Hypertension

Symptomatic hypertension

Duration

Arrhythmia

Systolic intensive care

Diastolic intensive care

Potassium

Sodium

AlAT

AsTK

WBC

ESR

Time

Outcome

Dropping other columns having more than 60% missing values

Heredity

Systolic emergency

Diastolic emergency

Final test dataset with no missing values

Final train dataset with no missing values

Machine learning models

KNN Classifier

Logistic

Support Vector Classifier

Decision Tree Classifier

Bagging Classifier

Gradient Boosting Classifier

Best model

Out of the above models, Gradient Boosting Classifier produces the best cross validation ROC AUC score of 0.786 and is hence the best model.

ROC AUC visualization

Post prediction Model tuning

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages