# Heart failure Case study

## Data processing

In the [`why we love numpy`](https://fennaf.gitbook.io/bfvm19prog1/study-cases/why-we-love-numpy) case, we applied linear regression to a randomly generated dataset. We can apply the same principles to data coming from experiments or studies as long as they are nicely structured in a 2d array format. Unfortunately, data from real-life cases is often not nicely structured. We need to manipulate the unstructured and/or messy data into a structured or clean form. We need to drop rows and columns because they are not needed for the analysis or because we cannot use them in case of too many missing values. Maybe we need to relabel columns or reformat characters into numerical values. Maybe we need to combine data from several sources. Cleaning and manipulating data into a structured form is called **data processing**. Data processing starts with data in its raw form and converts it into a more readable format (tables, graphs etc.), giving it the form and context necessary to be interpreted by computers and utilized by users.

In previous courses, you learned about the basics in programming python and object oriented python. In this course, we use python and the libraries `NumPy` and `Pandas`. These libraries are high-performance libraries especially suitable for data manipulations and data computations

.

## Data processing Example: Heart failure casus

Cardiovascular diseases kill approximately 17 million people globally every year, and they mainly exhibit as myocardial infarctions and heart failures. Heart failure occurs when the heart cannot pump enough blood to meet the needs of the body. Available electronic medical records of patients quantify symptoms, body features, and clinical laboratory test values, which can be used to perform biostatistics analysis aimed at highlighting patterns and correlations otherwise undetectable by medical doctors. Machine learning can predict patients’ survival from their data and can individuate the most important features among those included in their medical records\[1]. As a data scientist, you are required to inspect if the data can be used for modelling and to select the most important features for predicting the patient's survival. Data for the analysis is available in [`heart_failure_clinical_records_dataset.csv`](https://github.com/fenna/BFVM19PROG1/blob/main/data/heart_failure_clinical_records_dataset.csv). The data description is to be found in the table [`data_description.csv`](https://github.com/fenna/BFVM19PROG1/blob/main/data/data_description.csv)

\[1] <https://doi.org/10.1186/s12911-020-1023-5>

```python
import pandas as pd
import numpy as np
```

### Step 1: Inspect the data

The first step is inspecting the data and getting an idea about the meaning of the variables, format, and units.

```python
# load and display the meta data, the data that describes the data
md = pd.read_csv('data/data_description.csv', sep=';')
md
```

<br>

|    | Feature                  | Explanation                                       | Measurement      |
| -- | ------------------------ | ------------------------------------------------- | ---------------- |
| 0  | Age                      | Age of patient                                    | years            |
| 1  | Anaemia                  | Decrease of red blood cells or hemoglobin         | Boolean          |
| 2  | High blood pressure      | If a patient has hypertension                     | Boolean          |
| 3  | Creatinine phosphokinase | Level of the CPK enzyme in the blood              | mcg/L            |
| 4  | Diabetes                 | If the patient has diabetes                       | Boolean          |
| 5  | Ejection fraction        | Percentage of blood leaving the heart at each ... | Percentage       |
| 6  | Sex                      | Woman or Man                                      | Binary           |
| 7  | Platelets                | Platelets in the blood                            | kiloplatelets/mL |
| 8  | Serum creatinine         | Level of creatinine in the blood                  | mg/dL            |
| 9  | Serum sodium             | Level of sodium in the blood                      | mEq/L            |
| 10 | Smoking                  | If the patient smokes                             | Boolean          |
| 11 | Time                     | Follow-up period                                  | Days             |
| 12 | death event              | If the patient died during the follow-up period   | Boolean          |

The death event will be used to predict the survival rate and will be the class variable. The variable `death event` is a boolean. If the `death event` is 1 (True) then the patient died. If the `death event` = 0 (False) then the patient survived

```python
# load and display data 
df = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')
print(f'this dataset contains {len(df)} rows')
df.head(5)
```

```
this dataset contains 299 rows
```

|   | age  | anaemia | creatinine\_phosphokinase | diabetes | ejection\_fraction | high\_blood\_pressure | platelets | serum\_creatinine | serum\_sodium | sex | smoking | time | DEATH\_EVENT |
| - | ---- | ------- | ------------------------- | -------- | ------------------ | --------------------- | --------- | ----------------- | ------------- | --- | ------- | ---- | ------------ |
| 0 | 75.0 | 0       | 582                       | 0        | 20                 | 1                     | 265000.00 | 1.9               | 130           | 1.0 | 0.0     | 4    | 1            |
| 1 | 55.0 | 0       | 7861                      | 0        | 38                 | 0                     | 263358.03 | 1.1               | 136           | 1.0 | NaN     | 6    | 1            |
| 2 | 65.0 | 0       | 146                       | 0        | 20                 | 0                     | 162000.00 | 1.3               | 129           | 1.0 | 1.0     | 7    | 1            |
| 3 | 50.0 | 1       | 111                       | 0        | 20                 | 0                     | 210000.00 | 1.9               | 137           | 1.0 | 0.0     | 7    | 1            |
| 4 | 65.0 | 1       | 160                       | 1        | 20                 | 0                     | 327000.00 | 2.7               | 116           | 0.0 | 0.0     | 8    | 1            |

```python
list(df.columns)
```

```
['age',
 'anaemia',
 'creatinine_phosphokinase',
 'diabetes',
 'ejection_fraction',
 'high_blood_pressure',
 'platelets',
 'serum_creatinine',
 'serum_sodium',
 'sex',
 'smoking',
 'time',
 'DEATH_EVENT']
```

Mind you, the column names of the metadata are slightly different than the one from the clinical records. Also, the order is different. We must take that into account if we want to make use of the metadata to select a subset of the clinical records.

#### Missing data

Looking at the dataframe values we also see NaN in the column smoking. This means that the data contains missing data. Let us inspect the missing data

```python
# first inspect missing data
df.isnull().sum()
```

```
age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         1
smoking                     1
time                        0
DEATH_EVENT                 0
dtype: int64
```

The columns sex and smoking do have missing values. When columns have a lot of missing data we can think of dropping the column from the dataframe. In this case, we can either fill the column with a guessed value or we can drop the row.

```python
df = df.dropna(axis = 0) # drop NaN rows
print(f'this dataset contains {len(df)} rows')
df.isnull().sum()
```

```
this dataset contains 297 rows





age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64
```

```python
df.head()
```

|   | age  | anaemia | creatinine\_phosphokinase | diabetes | ejection\_fraction | high\_blood\_pressure | platelets | serum\_creatinine | serum\_sodium | sex | smoking | time | DEATH\_EVENT |
| - | ---- | ------- | ------------------------- | -------- | ------------------ | --------------------- | --------- | ----------------- | ------------- | --- | ------- | ---- | ------------ |
| 0 | 75.0 | 0       | 582                       | 0        | 20                 | 1                     | 265000.0  | 1.9               | 130           | 1.0 | 0.0     | 4    | 1            |
| 2 | 65.0 | 0       | 146                       | 0        | 20                 | 0                     | 162000.0  | 1.3               | 129           | 1.0 | 1.0     | 7    | 1            |
| 3 | 50.0 | 1       | 111                       | 0        | 20                 | 0                     | 210000.0  | 1.9               | 137           | 1.0 | 0.0     | 7    | 1            |
| 4 | 65.0 | 1       | 160                       | 1        | 20                 | 0                     | 327000.0  | 2.7               | 116           | 0.0 | 0.0     | 8    | 1            |
| 5 | 90.0 | 1       | 47                        | 0        | 40                 | 1                     | 204000.0  | 2.1               | 132           | 1.0 | 1.0     | 8    | 1            |

Furthermore, we can see that all the binary data and boolean Yes/No data is displayed by either a zero or a one. It might be unclear what this means when plotting the data.

```python
df['sex'].value_counts()
```

```
1.0    193
0.0    104
Name: sex, dtype: int64
```

In the metadata we see the description "Woman" or "man", so we might want to change that.

```python
df = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')
df['sex'] = df['sex'].astype('category') # make the format categorical
df['sex'] = df['sex'].map({0:"Woman", 1: "Man"}) # map the values to the category
df['sex'].value_counts()
```

```
Man      194
Woman    104
Name: sex, dtype: int64
```

#### Inspect the datatypes

We changed the sex column to category, but what datatypes are the other columns?

```python
df.info()
```

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   age                       299 non-null    float64 
 1   anaemia                   299 non-null    int64   
 2   creatinine_phosphokinase  299 non-null    int64   
 3   diabetes                  299 non-null    int64   
 4   ejection_fraction         299 non-null    int64   
 5   high_blood_pressure       299 non-null    int64   
 6   platelets                 299 non-null    float64 
 7   serum_creatinine          299 non-null    float64 
 8   serum_sodium              299 non-null    int64   
 9   sex                       298 non-null    category
 10  smoking                   298 non-null    float64 
 11  time                      299 non-null    int64   
 12  DEATH_EVENT               299 non-null    int64   
dtypes: category(1), float64(4), int64(8)
memory usage: 28.5 KB
```

We know that some of the integers should be booleans (logical). Let's change that

```python
df["anaemia"] = df["anaemia"].astype('bool')
df["high_blood_pressure"] = df["high_blood_pressure"].astype('bool')
df["diabetes"] = df["diabetes"].astype('bool')
df["smoking"] = df["smoking"].astype('bool')
df["DEATH_EVENT"] = df["DEATH_EVENT"].astype('bool')
df.head()
```

\
&#x20;

|   | age  | anaemia | creatinine\_phosphokinase | diabetes | ejection\_fraction | high\_blood\_pressure | platelets | serum\_creatinine | serum\_sodium | sex   | smoking | time | DEATH\_EVENT |
| - | ---- | ------- | ------------------------- | -------- | ------------------ | --------------------- | --------- | ----------------- | ------------- | ----- | ------- | ---- | ------------ |
| 0 | 75.0 | False   | 582                       | False    | 20                 | True                  | 265000.00 | 1.9               | 130           | Man   | False   | 4    | True         |
| 1 | 55.0 | False   | 7861                      | False    | 38                 | False                 | 263358.03 | 1.1               | 136           | Man   | True    | 6    | True         |
| 2 | 65.0 | False   | 146                       | False    | 20                 | False                 | 162000.00 | 1.3               | 129           | Man   | True    | 7    | True         |
| 3 | 50.0 | True    | 111                       | False    | 20                 | False                 | 210000.00 | 1.9               | 137           | Man   | False   | 7    | True         |
| 4 | 65.0 | True    | 160                       | True     | 20                 | False                 | 327000.00 | 2.7               | 116           | Woman | False   | 8    | True         |

```python
df['anaemia'].value_counts()
```

```
False    170
True     129
Name: anaemia, dtype: int64
```

```python
df.info()
```

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   age                       299 non-null    float64 
 1   anaemia                   299 non-null    bool    
 2   creatinine_phosphokinase  299 non-null    int64   
 3   diabetes                  299 non-null    bool    
 4   ejection_fraction         299 non-null    int64   
 5   high_blood_pressure       299 non-null    bool    
 6   platelets                 299 non-null    float64 
 7   serum_creatinine          299 non-null    float64 
 8   serum_sodium              299 non-null    int64   
 9   sex                       298 non-null    category
 10  smoking                   299 non-null    bool    
 11  time                      299 non-null    int64   
 12  DEATH_EVENT               299 non-null    bool    
dtypes: bool(5), category(1), float64(3), int64(4)
memory usage: 18.3 KB
```

### Step 2: Explore data

It is useful to understand the range of the data. A function that displays the descriptives of the numerical data is `describe`

```python
df.describe()
```

|       | age        | creatinine\_phosphokinase | ejection\_fraction | platelets     | serum\_creatinine | serum\_sodium | time       |
| ----- | ---------- | ------------------------- | ------------------ | ------------- | ----------------- | ------------- | ---------- |
| count | 299.000000 | 299.000000                | 299.000000         | 299.000000    | 299.00000         | 299.000000    | 299.000000 |
| mean  | 60.833893  | 581.839465                | 38.083612          | 263358.029264 | 1.39388           | 136.625418    | 130.260870 |
| std   | 11.894809  | 970.287881                | 11.834841          | 97804.236869  | 1.03451           | 4.412477      | 77.614208  |
| min   | 40.000000  | 23.000000                 | 14.000000          | 25100.000000  | 0.50000           | 113.000000    | 4.000000   |
| 25%   | 51.000000  | 116.500000                | 30.000000          | 212500.000000 | 0.90000           | 134.000000    | 73.000000  |
| 50%   | 60.000000  | 250.000000                | 38.000000          | 262000.000000 | 1.10000           | 137.000000    | 115.000000 |
| 75%   | 70.000000  | 582.000000                | 45.000000          | 303500.000000 | 1.40000           | 140.000000    | 203.000000 |
| max   | 95.000000  | 7861.000000               | 80.000000          | 850000.000000 | 9.40000           | 148.000000    | 285.000000 |

What we can see is that the data ranges differ per feature. If we want to use the data for prediction we need to normalize the data later on. We can do that with numpy. From the describe table we can also see that most of the data is not symmetric distributed. Let us inspect the distributions by plotting.

### Plotting the data

#### Plot distributions of numeric values

```python
#plot numeric values distributions
df_num = df.select_dtypes(include=['float64', 'int64'])
```

```python
from bokeh.layouts import gridplot
from bokeh.plotting import figure, output_file, show

def make_plot(title, hist, edges):
    p = figure(title=title, tools='', background_fill_color="#fafafa")
    p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
           fill_color="navy", line_color="white", alpha=0.5)
    p.y_range.start = 0
    p.xaxis.axis_label = 'value'
    p.yaxis.axis_label = 'count'
    p.grid.grid_line_color="white"
    return p

# Distribution
g = []
for i in range(len(df_num.columns)):
    hist, edges = np.histogram(df_num[df_num.columns[i]], bins=40)
    p = make_plot(f" {df_num.columns[i]}", hist, edges)
    g.append(p)


output_file('histogram.html', title="distribution plots")
show(gridplot(g, ncols=4, plot_width=250, plot_height=250, toolbar_location=None))
```

####

![plotting the distributions ](/files/-MJwmeg87HP68osHYW__)

```python
grouped = pd.DataFrame(df.groupby('time')['DEATH_EVENT'].sum())
print(grouped.head(10))
```

```
      DEATH_EVENT
time             
4             1.0
6             1.0
7             2.0
8             2.0
10            6.0
11            2.0
12            0.0
13            1.0
14            2.0
15            2.0
```

```python
p = figure(title="death events in time")
p.vbar(x='time', top='DEATH_EVENT', width=0.9, source=grouped)
p.xaxis.axis_label = 'number of days'
p.yaxis.axis_label = 'number of deaths'
show(p)

output_file('bar.html')
```

![](/files/-MJwmSZT4F1K4X-iopKA)

### Step 3: Clean data

Based on the inspecting and exploration of the data it is decided to drop the column time. The feature time will not be used for prediction. All the other variables will be used for further analysis. For computation convenience, the int64 data is used instead of booleans and categories. Furthermore, the data needs to be transformed and normalized.

```python
import numpy as np

df = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')
df = df.dropna(axis = 0) # drop NaN rows
df = df.drop(['time'],axis = 1) # drop time column
df = df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)] # remove outliers

print(f'this dataset contains {len(df)} rows and {len(df.columns)} columns')
df.head()
```

```
this dataset contains 280 rows and 12 columns
```

|   | age  | anaemia | creatinine\_phosphokinase | diabetes | ejection\_fraction | high\_blood\_pressure | platelets | serum\_creatinine | serum\_sodium | sex | smoking | DEATH\_EVENT |
| - | ---- | ------- | ------------------------- | -------- | ------------------ | --------------------- | --------- | ----------------- | ------------- | --- | ------- | ------------ |
| 0 | 75.0 | 0       | 582                       | 0        | 20                 | 1                     | 265000.0  | 1.9               | 130           | 1.0 | 0.0     | 1            |
| 2 | 65.0 | 0       | 146                       | 0        | 20                 | 0                     | 162000.0  | 1.3               | 129           | 1.0 | 1.0     | 1            |
| 3 | 50.0 | 1       | 111                       | 0        | 20                 | 0                     | 210000.0  | 1.9               | 137           | 1.0 | 0.0     | 1            |
| 5 | 90.0 | 1       | 47                        | 0        | 40                 | 1                     | 204000.0  | 2.1               | 132           | 1.0 | 1.0     | 1            |
| 6 | 75.0 | 1       | 246                       | 0        | 15                 | 0                     | 127000.0  | 1.2               | 137           | 1.0 | 0.0     | 1            |

### Step 4: Split into features matrix and class vector. Normalize features

```python
y = np.array(df['DEATH_EVENT'])
X = np.array(df.iloc[:,0:11])
print(y.shape)
print(X.shape)
y = y.reshape(-1, 1)
print(y.shape)
```

```
(280,)
(280, 11)
(280, 1)
```

```python
# normaliseer data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler = scaler.fit(X)
X = scaler.transform(X)
```

We now have a cleaned normalized feature matrix and a class variable vector. We successfully prepared the dataset for machine learning algorithms in order to predict the heart failure death event.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://fennaf.gitbook.io/bfvm22prog1/study-cases/heart-failure-case-study.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
