Heart failure Case study
Preparing for machine learning
Last updated
Preparing for machine learning
Last updated
In the case, we applied linear regression to a randomly generated dataset. We can apply the same principles to data coming from experiments or studies as long as they are nicely structured in a 2d array format. Unfortunately, data from real-life cases is often not nicely structured. We need to manipulate the unstructured and/or messy data into a structured or clean form. We need to drop rows and columns because they are not needed for the analysis or because we cannot use them in case of too many missing values. Maybe we need to relabel columns or reformat characters into numerical values. Maybe we need to combine data from several sources. Cleaning and manipulating data into a structured form is called data processing. Data processing starts with data in its raw form and converts it into a more readable format (tables, graphs etc.), giving it the form and context necessary to be interpreted by computers and utilized by users.
In previous courses, you learned about the basics in programming python and object oriented python. In this course, we use python and the libraries NumPy
and Pandas
. These libraries are high-performance libraries especially suitable for data manipulations and data computations
.
Cardiovascular diseases kill approximately 17 million people globally every year, and they mainly exhibit as myocardial infarctions and heart failures. Heart failure occurs when the heart cannot pump enough blood to meet the needs of the body. Available electronic medical records of patients quantify symptoms, body features, and clinical laboratory test values, which can be used to perform biostatistics analysis aimed at highlighting patterns and correlations otherwise undetectable by medical doctors. Machine learning can predict patients’ survival from their data and can individuate the most important features among those included in their medical records[1]. As a data scientist, you are required to inspect if the data can be used for modelling and to select the most important features for predicting the patient's survival. Data for the analysis is available in . The data description is to be found in the table
[1]
The first step is inspecting the data and getting an idea about the meaning of the variables, format, and units.
Feature
Explanation
Measurement
0
Age
Age of patient
years
1
Anaemia
Decrease of red blood cells or hemoglobin
Boolean
2
High blood pressure
If a patient has hypertension
Boolean
3
Creatinine phosphokinase
Level of the CPK enzyme in the blood
mcg/L
4
Diabetes
If the patient has diabetes
Boolean
5
Ejection fraction
Percentage of blood leaving the heart at each ...
Percentage
6
Sex
Woman or Man
Binary
7
Platelets
Platelets in the blood
kiloplatelets/mL
8
Serum creatinine
Level of creatinine in the blood
mg/dL
9
Serum sodium
Level of sodium in the blood
mEq/L
10
Smoking
If the patient smokes
Boolean
11
Time
Follow-up period
Days
12
death event
If the patient died during the follow-up period
Boolean
The death event will be used to predict the survival rate and will be the class variable. The variable death event
is a boolean. If the death event
is 1 (True) then the patient died. If the death event
= 0 (False) then the patient survived
age
anaemia
creatinine_phosphokinase
diabetes
ejection_fraction
high_blood_pressure
platelets
serum_creatinine
serum_sodium
sex
smoking
time
DEATH_EVENT
0
75.0
0
582
0
20
1
265000.00
1.9
130
1.0
0.0
4
1
1
55.0
0
7861
0
38
0
263358.03
1.1
136
1.0
NaN
6
1
2
65.0
0
146
0
20
0
162000.00
1.3
129
1.0
1.0
7
1
3
50.0
1
111
0
20
0
210000.00
1.9
137
1.0
0.0
7
1
4
65.0
1
160
1
20
0
327000.00
2.7
116
0.0
0.0
8
1
Mind you, the column names of the metadata are slightly different than the one from the clinical records. Also, the order is different. We must take that into account if we want to make use of the metadata to select a subset of the clinical records.
Looking at the dataframe values we also see NaN in the column smoking. This means that the data contains missing data. Let us inspect the missing data
The columns sex and smoking do have missing values. When columns have a lot of missing data we can think of dropping the column from the dataframe. In this case, we can either fill the column with a guessed value or we can drop the row.
age
anaemia
creatinine_phosphokinase
diabetes
ejection_fraction
high_blood_pressure
platelets
serum_creatinine
serum_sodium
sex
smoking
time
DEATH_EVENT
0
75.0
0
582
0
20
1
265000.0
1.9
130
1.0
0.0
4
1
2
65.0
0
146
0
20
0
162000.0
1.3
129
1.0
1.0
7
1
3
50.0
1
111
0
20
0
210000.0
1.9
137
1.0
0.0
7
1
4
65.0
1
160
1
20
0
327000.0
2.7
116
0.0
0.0
8
1
5
90.0
1
47
0
40
1
204000.0
2.1
132
1.0
1.0
8
1
Furthermore, we can see that all the binary data and boolean Yes/No data is displayed by either a zero or a one. It might be unclear what this means when plotting the data.
In the metadata we see the description "Woman" or "man", so we might want to change that.
We changed the sex column to category, but what datatypes are the other columns?
We know that some of the integers should be booleans (logical). Let's change that
age
anaemia
creatinine_phosphokinase
diabetes
ejection_fraction
high_blood_pressure
platelets
serum_creatinine
serum_sodium
sex
smoking
time
DEATH_EVENT
0
75.0
False
582
False
20
True
265000.00
1.9
130
Man
False
4
True
1
55.0
False
7861
False
38
False
263358.03
1.1
136
Man
True
6
True
2
65.0
False
146
False
20
False
162000.00
1.3
129
Man
True
7
True
3
50.0
True
111
False
20
False
210000.00
1.9
137
Man
False
7
True
4
65.0
True
160
True
20
False
327000.00
2.7
116
Woman
False
8
True
It is useful to understand the range of the data. A function that displays the descriptives of the numerical data is describe
age
creatinine_phosphokinase
ejection_fraction
platelets
serum_creatinine
serum_sodium
time
count
299.000000
299.000000
299.000000
299.000000
299.00000
299.000000
299.000000
mean
60.833893
581.839465
38.083612
263358.029264
1.39388
136.625418
130.260870
std
11.894809
970.287881
11.834841
97804.236869
1.03451
4.412477
77.614208
min
40.000000
23.000000
14.000000
25100.000000
0.50000
113.000000
4.000000
25%
51.000000
116.500000
30.000000
212500.000000
0.90000
134.000000
73.000000
50%
60.000000
250.000000
38.000000
262000.000000
1.10000
137.000000
115.000000
75%
70.000000
582.000000
45.000000
303500.000000
1.40000
140.000000
203.000000
max
95.000000
7861.000000
80.000000
850000.000000
9.40000
148.000000
285.000000
What we can see is that the data ranges differ per feature. If we want to use the data for prediction we need to normalize the data later on. We can do that with numpy. From the describe table we can also see that most of the data is not symmetric distributed. Let us inspect the distributions by plotting.
Based on the inspecting and exploration of the data it is decided to drop the column time. The feature time will not be used for prediction. All the other variables will be used for further analysis. For computation convenience, the int64 data is used instead of booleans and categories. Furthermore, the data needs to be transformed and normalized.
age
anaemia
creatinine_phosphokinase
diabetes
ejection_fraction
high_blood_pressure
platelets
serum_creatinine
serum_sodium
sex
smoking
DEATH_EVENT
0
75.0
0
582
0
20
1
265000.0
1.9
130
1.0
0.0
1
2
65.0
0
146
0
20
0
162000.0
1.3
129
1.0
1.0
1
3
50.0
1
111
0
20
0
210000.0
1.9
137
1.0
0.0
1
5
90.0
1
47
0
40
1
204000.0
2.1
132
1.0
1.0
1
6
75.0
1
246
0
15
0
127000.0
1.2
137
1.0
0.0
1
We now have a cleaned normalized feature matrix and a class variable vector. We successfully prepared the dataset for machine learning algorithms in order to predict the heart failure death event.