Heart failure Case study

Preparing for machine learning

Data processing

In the why we love numpy case, we applied linear regression to a randomly generated dataset. We can apply the same principles to data coming from experiments or studies as long as they are nicely structured in a 2d array format. Unfortunately, data from real-life cases is often not nicely structured. We need to manipulate the unstructured and/or messy data into a structured or clean form. We need to drop rows and columns because they are not needed for the analysis or because we cannot use them in case of too many missing values. Maybe we need to relabel columns or reformat characters into numerical values. Maybe we need to combine data from several sources. Cleaning and manipulating data into a structured form is called data processing. Data processing starts with data in its raw form and converts it into a more readable format (tables, graphs etc.), giving it the form and context necessary to be interpreted by computers and utilized by users.

In previous courses, you learned about the basics in programming python and object oriented python. In this course, we use python and the libraries NumPy and Pandas. These libraries are high-performance libraries especially suitable for data manipulations and data computations

.

Data processing Example: Heart failure casus

Cardiovascular diseases kill approximately 17 million people globally every year, and they mainly exhibit as myocardial infarctions and heart failures. Heart failure occurs when the heart cannot pump enough blood to meet the needs of the body. Available electronic medical records of patients quantify symptoms, body features, and clinical laboratory test values, which can be used to perform biostatistics analysis aimed at highlighting patterns and correlations otherwise undetectable by medical doctors. Machine learning can predict patients’ survival from their data and can individuate the most important features among those included in their medical records[1]. As a data scientist, you are required to inspect if the data can be used for modelling and to select the most important features for predicting the patient's survival. Data for the analysis is available in heart_failure_clinical_records_dataset.csv. The data description is to be found in the table data_description.csv

[1] https://doi.org/10.1186/s12911-020-1023-5

import pandas as pd
import numpy as np

Step 1: Inspect the data

The first step is inspecting the data and getting an idea about the meaning of the variables, format, and units.

# load and display the meta data, the data that describes the data
md = pd.read_csv('data/data_description.csv', sep=';')
md

Feature

Explanation

Measurement

0

Age

Age of patient

years

1

Anaemia

Decrease of red blood cells or hemoglobin

Boolean

2

High blood pressure

If a patient has hypertension

Boolean

3

Creatinine phosphokinase

Level of the CPK enzyme in the blood

mcg/L

4

Diabetes

If the patient has diabetes

Boolean

5

Ejection fraction

Percentage of blood leaving the heart at each ...

Percentage

6

Sex

Woman or Man

Binary

7

Platelets

Platelets in the blood

kiloplatelets/mL

8

Serum creatinine

Level of creatinine in the blood

mg/dL

9

Serum sodium

Level of sodium in the blood

mEq/L

10

Smoking

If the patient smokes

Boolean

11

Time

Follow-up period

Days

12

death event

If the patient died during the follow-up period

Boolean

The death event will be used to predict the survival rate and will be the class variable. The variable death event is a boolean. If the death event is 1 (True) then the patient died. If the death event = 0 (False) then the patient survived

# load and display data 
df = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')
print(f'this dataset contains {len(df)} rows')
df.head(5)
this dataset contains 299 rows

age

anaemia

creatinine_phosphokinase

diabetes

ejection_fraction

high_blood_pressure

platelets

serum_creatinine

serum_sodium

sex

smoking

time

DEATH_EVENT

0

75.0

0

582

0

20

1

265000.00

1.9

130

1.0

0.0

4

1

1

55.0

0

7861

0

38

0

263358.03

1.1

136

1.0

NaN

6

1

2

65.0

0

146

0

20

0

162000.00

1.3

129

1.0

1.0

7

1

3

50.0

1

111

0

20

0

210000.00

1.9

137

1.0

0.0

7

1

4

65.0

1

160

1

20

0

327000.00

2.7

116

0.0

0.0

8

1

list(df.columns)
['age',
 'anaemia',
 'creatinine_phosphokinase',
 'diabetes',
 'ejection_fraction',
 'high_blood_pressure',
 'platelets',
 'serum_creatinine',
 'serum_sodium',
 'sex',
 'smoking',
 'time',
 'DEATH_EVENT']

Mind you, the column names of the metadata are slightly different than the one from the clinical records. Also, the order is different. We must take that into account if we want to make use of the metadata to select a subset of the clinical records.

Missing data

Looking at the dataframe values we also see NaN in the column smoking. This means that the data contains missing data. Let us inspect the missing data

# first inspect missing data
df.isnull().sum()
age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         1
smoking                     1
time                        0
DEATH_EVENT                 0
dtype: int64

The columns sex and smoking do have missing values. When columns have a lot of missing data we can think of dropping the column from the dataframe. In this case, we can either fill the column with a guessed value or we can drop the row.

df = df.dropna(axis = 0) # drop NaN rows
print(f'this dataset contains {len(df)} rows')
df.isnull().sum()
this dataset contains 297 rows





age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64
df.head()

age

anaemia

creatinine_phosphokinase

diabetes

ejection_fraction

high_blood_pressure

platelets

serum_creatinine

serum_sodium

sex

smoking

time

DEATH_EVENT

0

75.0

0

582

0

20

1

265000.0

1.9

130

1.0

0.0

4

1

2

65.0

0

146

0

20

0

162000.0

1.3

129

1.0

1.0

7

1

3

50.0

1

111

0

20

0

210000.0

1.9

137

1.0

0.0

7

1

4

65.0

1

160

1

20

0

327000.0

2.7

116

0.0

0.0

8

1

5

90.0

1

47

0

40

1

204000.0

2.1

132

1.0

1.0

8

1

Furthermore, we can see that all the binary data and boolean Yes/No data is displayed by either a zero or a one. It might be unclear what this means when plotting the data.

df['sex'].value_counts()
1.0    193
0.0    104
Name: sex, dtype: int64

In the metadata we see the description "Woman" or "man", so we might want to change that.

df = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')
df['sex'] = df['sex'].astype('category') # make the format categorical
df['sex'] = df['sex'].map({0:"Woman", 1: "Man"}) # map the values to the category
df['sex'].value_counts()
Man      194
Woman    104
Name: sex, dtype: int64

Inspect the datatypes

We changed the sex column to category, but what datatypes are the other columns?

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   age                       299 non-null    float64 
 1   anaemia                   299 non-null    int64   
 2   creatinine_phosphokinase  299 non-null    int64   
 3   diabetes                  299 non-null    int64   
 4   ejection_fraction         299 non-null    int64   
 5   high_blood_pressure       299 non-null    int64   
 6   platelets                 299 non-null    float64 
 7   serum_creatinine          299 non-null    float64 
 8   serum_sodium              299 non-null    int64   
 9   sex                       298 non-null    category
 10  smoking                   298 non-null    float64 
 11  time                      299 non-null    int64   
 12  DEATH_EVENT               299 non-null    int64   
dtypes: category(1), float64(4), int64(8)
memory usage: 28.5 KB

We know that some of the integers should be booleans (logical). Let's change that

df["anaemia"] = df["anaemia"].astype('bool')
df["high_blood_pressure"] = df["high_blood_pressure"].astype('bool')
df["diabetes"] = df["diabetes"].astype('bool')
df["smoking"] = df["smoking"].astype('bool')
df["DEATH_EVENT"] = df["DEATH_EVENT"].astype('bool')
df.head()

age

anaemia

creatinine_phosphokinase

diabetes

ejection_fraction

high_blood_pressure

platelets

serum_creatinine

serum_sodium

sex

smoking

time

DEATH_EVENT

0

75.0

False

582

False

20

True

265000.00

1.9

130

Man

False

4

True

1

55.0

False

7861

False

38

False

263358.03

1.1

136

Man

True

6

True

2

65.0

False

146

False

20

False

162000.00

1.3

129

Man

True

7

True

3

50.0

True

111

False

20

False

210000.00

1.9

137

Man

False

7

True

4

65.0

True

160

True

20

False

327000.00

2.7

116

Woman

False

8

True

df['anaemia'].value_counts()
False    170
True     129
Name: anaemia, dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   age                       299 non-null    float64 
 1   anaemia                   299 non-null    bool    
 2   creatinine_phosphokinase  299 non-null    int64   
 3   diabetes                  299 non-null    bool    
 4   ejection_fraction         299 non-null    int64   
 5   high_blood_pressure       299 non-null    bool    
 6   platelets                 299 non-null    float64 
 7   serum_creatinine          299 non-null    float64 
 8   serum_sodium              299 non-null    int64   
 9   sex                       298 non-null    category
 10  smoking                   299 non-null    bool    
 11  time                      299 non-null    int64   
 12  DEATH_EVENT               299 non-null    bool    
dtypes: bool(5), category(1), float64(3), int64(4)
memory usage: 18.3 KB

Step 2: Explore data

It is useful to understand the range of the data. A function that displays the descriptives of the numerical data is describe

df.describe()

age

creatinine_phosphokinase

ejection_fraction

platelets

serum_creatinine

serum_sodium

time

count

299.000000

299.000000

299.000000

299.000000

299.00000

299.000000

299.000000

mean

60.833893

581.839465

38.083612

263358.029264

1.39388

136.625418

130.260870

std

11.894809

970.287881

11.834841

97804.236869

1.03451

4.412477

77.614208

min

40.000000

23.000000

14.000000

25100.000000

0.50000

113.000000

4.000000

25%

51.000000

116.500000

30.000000

212500.000000

0.90000

134.000000

73.000000

50%

60.000000

250.000000

38.000000

262000.000000

1.10000

137.000000

115.000000

75%

70.000000

582.000000

45.000000

303500.000000

1.40000

140.000000

203.000000

max

95.000000

7861.000000

80.000000

850000.000000

9.40000

148.000000

285.000000

What we can see is that the data ranges differ per feature. If we want to use the data for prediction we need to normalize the data later on. We can do that with numpy. From the describe table we can also see that most of the data is not symmetric distributed. Let us inspect the distributions by plotting.

Plotting the data

Plot distributions of numeric values

#plot numeric values distributions
df_num = df.select_dtypes(include=['float64', 'int64'])
from bokeh.layouts import gridplot
from bokeh.plotting import figure, output_file, show

def make_plot(title, hist, edges):
    p = figure(title=title, tools='', background_fill_color="#fafafa")
    p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
           fill_color="navy", line_color="white", alpha=0.5)
    p.y_range.start = 0
    p.xaxis.axis_label = 'value'
    p.yaxis.axis_label = 'count'
    p.grid.grid_line_color="white"
    return p

# Distribution
g = []
for i in range(len(df_num.columns)):
    hist, edges = np.histogram(df_num[df_num.columns[i]], bins=40)
    p = make_plot(f" {df_num.columns[i]}", hist, edges)
    g.append(p)


output_file('histogram.html', title="distribution plots")
show(gridplot(g, ncols=4, plot_width=250, plot_height=250, toolbar_location=None))

plotting the distributions
grouped = pd.DataFrame(df.groupby('time')['DEATH_EVENT'].sum())
print(grouped.head(10))
      DEATH_EVENT
time             
4             1.0
6             1.0
7             2.0
8             2.0
10            6.0
11            2.0
12            0.0
13            1.0
14            2.0
15            2.0
p = figure(title="death events in time")
p.vbar(x='time', top='DEATH_EVENT', width=0.9, source=grouped)
p.xaxis.axis_label = 'number of days'
p.yaxis.axis_label = 'number of deaths'
show(p)

output_file('bar.html')

Step 3: Clean data

Based on the inspecting and exploration of the data it is decided to drop the column time. The feature time will not be used for prediction. All the other variables will be used for further analysis. For computation convenience, the int64 data is used instead of booleans and categories. Furthermore, the data needs to be transformed and normalized.

import numpy as np

df = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')
df = df.dropna(axis = 0) # drop NaN rows
df = df.drop(['time'],axis = 1) # drop time column
df = df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)] # remove outliers

print(f'this dataset contains {len(df)} rows and {len(df.columns)} columns')
df.head()
this dataset contains 280 rows and 12 columns

age

anaemia

creatinine_phosphokinase

diabetes

ejection_fraction

high_blood_pressure

platelets

serum_creatinine

serum_sodium

sex

smoking

DEATH_EVENT

0

75.0

0

582

0

20

1

265000.0

1.9

130

1.0

0.0

1

2

65.0

0

146

0

20

0

162000.0

1.3

129

1.0

1.0

1

3

50.0

1

111

0

20

0

210000.0

1.9

137

1.0

0.0

1

5

90.0

1

47

0

40

1

204000.0

2.1

132

1.0

1.0

1

6

75.0

1

246

0

15

0

127000.0

1.2

137

1.0

0.0

1

Step 4: Split into features matrix and class vector. Normalize features

y = np.array(df['DEATH_EVENT'])
X = np.array(df.iloc[:,0:11])
print(y.shape)
print(X.shape)
y = y.reshape(-1, 1)
print(y.shape)
(280,)
(280, 11)
(280, 1)
# normaliseer data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler = scaler.fit(X)
X = scaler.transform(X)

We now have a cleaned normalized feature matrix and a class variable vector. We successfully prepared the dataset for machine learning algorithms in order to predict the heart failure death event.

Last updated