# Dealing with missing values

&#x20;Missing data occurs commonly in many data applications. Especially wearable sensor devices encounter measurement errors resulting in missing data. Although pandas can handle missing data, for instance, to exclude missing data in the descriptive analysis, we might decide to reshape our data to improve quality. We might consider to exclude certain rows, columns, or decide to impute our data. As long as we take an argumentative approach.  The most common methods to handle missing data are:

```python
.dropna(axis) # drops columns or rows (axis) when contain NaN
.fillna() # fills the missing values with a value
.isnull() # boolean indicating missing values
```

{% hint style="danger" %}
Mind you that it is not always sufficient just to check if a file contains NaN's. Sometimes a file used spaces as a missing value or another sign like `?`. Especially in the case that a column is of the datatype string when you expect a number you should be alarmed. It might contain a string indicating missing values and therefore the entire column is set to a string datatype. If such is the case we need to change the string indicating the missing value to a NaN and we need to typecast it to the required format
{% endhint %}

```python
df.replace(r'\s+', np.nan, regex=True) 
df["glucose.scan"] = pd.to_numeric(df["glucose.scan"])
```

## Working example: the glucose level measurement

In the following example, we look at a file containing glucose level measurements. First, we import a dataset into a pandas DataFrame. We change the index to the time of measurement to inspect the data related to time. In this file indeed we have NaNs and spaces!

```python
import pandas as pd
import numpy as np
df=pd.read_table('data/JamesBondGlucoseLevels25.txt')
df.head(5)
```

|     | ID   | Time             | Type | hist.glucose.mg.dL. | Scan.glucose.mg.dL. | Not.numeric.fast.insuline | fast.insuline.unit. | Non.numeric.food | carbonhydrates.gram. | Non.numeric.long.insuline | long.insuline.unit. | Remarks | Strip.glucose.mg.dL. | Keton.mmol.L. | meal.insuline.unit. | Correctioninsuline.unit. | userchange.insuline.unit. | previous.time | adjusted.time |
| --- | ---- | ---------------- | ---- | ------------------- | ------------------- | ------------------------- | ------------------- | ---------------- | -------------------- | ------------------------- | ------------------- | ------- | -------------------- | ------------- | ------------------- | ------------------------ | ------------------------- | ------------- | ------------- |
| 561 | 2845 | 2019-04-25 00:08 | 1    | NaN                 | 109                 | NaN                       | NaN                 | NaN              | NaN                  | NaN                       | NaN                 | NaN     | NaN                  | NaN           | NaN                 | NaN                      | NaN                       | NaN           | NaN           |
| 562 | 2850 | 2019-04-25 00:50 | 1    | NaN                 |                     | NaN                       | NaN                 | NaN              | NaN                  | NaN                       | NaN                 | NaN     | NaN                  | NaN           | NaN                 | NaN                      | NaN                       | NaN           | NaN           |
| 563 | 2877 | 2019-04-25 07:02 | 1    | NaN                 | 123                 | NaN                       | NaN                 | NaN              | NaN                  | NaN                       | NaN                 | NaN     | NaN                  | NaN           | NaN                 | NaN                      | NaN                       | NaN           | NaN           |
| 564 | 2881 | 2019-04-25 07:34 | 1    | NaN                 | 158                 | NaN                       | NaN                 | NaN              | NaN                  | NaN                       | NaN                 | NaN     | NaN                  | NaN           | NaN                 | NaN                      | NaN                       | NaN           | NaN           |
| 565 | 2886 | 2019-04-25 08:19 | 1    | NaN                 |                     | NaN                       | NaN                 | NaN              | NaN                  | NaN                       | NaN                 | NaN     | NaN                  | NaN           | NaN                 | NaN                      | NaN                       | NaN           | NaN           |

Let us change the index since 'Time' makes more sense to use as an index

```python
df=df.set_index('Time')
df.head(5)
```

<br>

|                  | ID   | Type | hist.glucose.mg.dL. | Scan.glucose.mg.dL. | Not.numeric.fast.insuline | fast.insuline.unit. | Non.numeric.food | carbonhydrates.gram. | Non.numeric.long.insuline | long.insuline.unit. | Remarks | Strip.glucose.mg.dL. | Keton.mmol.L. | meal.insuline.unit. | Correctioninsuline.unit. | userchange.insuline.unit. | previous.time | adjusted.time |
| ---------------- | ---- | ---- | ------------------- | ------------------- | ------------------------- | ------------------- | ---------------- | -------------------- | ------------------------- | ------------------- | ------- | -------------------- | ------------- | ------------------- | ------------------------ | ------------------------- | ------------- | ------------- |
| Time             |      |      |                     |                     |                           |                     |                  |                      |                           |                     |         |                      |               |                     |                          |                           |               |               |
| 2019-04-25 00:08 | 2845 | 1    | NaN                 | 109                 | NaN                       | NaN                 | NaN              | NaN                  | NaN                       | NaN                 | NaN     | NaN                  | NaN           | NaN                 | NaN                      | NaN                       | NaN           | NaN           |
| 2019-04-25 00:50 | 2850 | 1    | NaN                 |                     | NaN                       | NaN                 | NaN              | NaN                  | NaN                       | NaN                 | NaN     | NaN                  | NaN           | NaN                 | NaN                      | NaN                       | NaN           | NaN           |
| 2019-04-25 07:02 | 2877 | 1    | NaN                 | 123                 | NaN                       | NaN                 | NaN              | NaN                  | NaN                       | NaN                 | NaN     | NaN                  | NaN           | NaN                 | NaN                      | NaN                       | NaN           | NaN           |
| 2019-04-25 07:34 | 2881 | 1    | NaN                 | 158                 | NaN                       | NaN                 | NaN              | NaN                  | NaN                       | NaN                 | NaN     | NaN                  | NaN           | NaN                 | NaN                      | NaN                       | NaN           | NaN           |
| 2019-04-25 08:19 | 2886 | 1    | NaN                 |                     | NaN                       | NaN                 | NaN              | NaN                  | NaN                       | NaN                 | NaN     | NaN                  | NaN           | NaN                 | NaN                      | NaN                       | NaN           | NaN           |

Now we clean the data by rename some columns and select only the needed information. I choose to fill the empty spaces with `np.nan` for further processing

```python
def reformat():
    df=pd.read_table('data/JamesBondGlucoseLevels25.txt')
    df=df.set_index('Time')
    df = df.rename(columns = {
                         'hist.glucose.mg.dL.':'glucose.hist',
                         'Scan.glucose.mg.dL.':'glucose.scan',
                         'previous.time':'time.previous',
                         'adjusted.time':'time.new'})
    columns_to_keep = [
                   'Type',
                   'glucose.hist',
                   'glucose.scan',
                   'time.previous',
                   'time.new']
    df = df[columns_to_keep].replace(r'\s+', np.nan, regex=True) 
    return df

df = reformat()
df.head(5)
```

|                  | Type | glucose.hist | glucose.scan | time.previous | time.new |
| ---------------- | ---- | ------------ | ------------ | ------------- | -------- |
| Time             |      |              |              |               |          |
| 2019-04-25 00:08 | 1    | NaN          | 109          | NaN           | NaN      |
| 2019-04-25 00:50 | 1    | NaN          | NaN          | NaN           | NaN      |
| 2019-04-25 07:02 | 1    | NaN          | 123          | NaN           | NaN      |
| 2019-04-25 07:34 | 1    | NaN          | 158          | NaN           | NaN      |
| 2019-04-25 08:19 | 1    | NaN          | NaN          | NaN           | NaN      |

## Automatically fill missing values

One of the handy functions that Pandas has for working with missing values is the filling function, `fillna`. This function takes a number or parameters, for instance, you could pass in a single value which is called a scalar value to change all of the missing data to one value.

```python
df.fillna?
```

```python
df = df.sort_index()
df.head(5)
```

|                  | Type | glucose.hist | glucose.scan | time.previous | time.new |
| ---------------- | ---- | ------------ | ------------ | ------------- | -------- |
| Time             |      |              |              |               |          |
| 2019-04-25 00:08 | 1    | NaN          | 109          | NaN           | NaN      |
| 2019-04-25 00:14 | 0    | 93.0         | NaN          | NaN           | NaN      |
| 2019-04-25 00:29 | 0    | 95.0         | NaN          | NaN           | NaN      |
| 2019-04-25 00:44 | 0    | 103.0        | NaN          | NaN           | NaN      |
| 2019-04-25 00:50 | 1    | NaN          | NaN          | NaN           | NaN      |

## Fill by method parameters

The two common fill values are `ffill` and `bfill`. `ffill`is for forward filling and it updates an NaN value for a particular cell with the value from the previous row.&#x20;

{% hint style="warning" %}
It's important to note that your data needs to be sorted in order for this to have the effect you might want. Data that comes from traditional database management systems usually has no order guarantee, So be careful. First sort the data
{% endhint %}

```python
df = df.fillna(method='ffill')
df.head(5)
```

\
&#x20; &#x20;

|                  | Type | glucose.hist | glucose.scan | time.previous | time.new |
| ---------------- | ---- | ------------ | ------------ | ------------- | -------- |
| Time             |      |              |              |               |          |
| 2019-04-25 00:08 | 1    | NaN          | 109          | NaN           | NaN      |
| 2019-04-25 00:14 | 0    | 93.0         | 109          | NaN           | NaN      |
| 2019-04-25 00:29 | 0    | 95.0         | 109          | NaN           | NaN      |
| 2019-04-25 00:44 | 0    | 103.0        | 109          | NaN           | NaN      |
| 2019-04-25 00:50 | 1    | 103.0        | 109          | NaN           | NaN      |

```python
df = df.fillna(method='bfill')
df.head(5)
```

\
&#x20;  <br>

|                  | Type | glucose.hist | glucose.scan | time.previous | time.new |
| ---------------- | ---- | ------------ | ------------ | ------------- | -------- |
| Time             |      |              |              |               |          |
| 2019-04-25 00:08 | 1    | 93.0         | 109          | NaN           | NaN      |
| 2019-04-25 00:14 | 0    | 93.0         | 109          | NaN           | NaN      |
| 2019-04-25 00:29 | 0    | 95.0         | 109          | NaN           | NaN      |
| 2019-04-25 00:44 | 0    | 103.0        | 109          | NaN           | NaN      |
| 2019-04-25 00:50 | 1    | 103.0        | 109          | NaN           | NaN      |

## Imputation

In the example above we filled the data with previous or following data. This might be handy in the case of few missing data. Another strategy might be to fill the missing data with calculated data (average). You could use `np.mean` for instance. This can be done easily since when you use statistical functions on DataFrames, these functions typically ignore missing values. For instance if you try and calculate the mean value of a DataFrame, the underlying NumPy function will ignore missing values. This is usually what you want but you should be aware that values are being excluded

```python
df = reformat()

df.sort_index()
df.head(5)
```

\
&#x20;   .<br>

|                  | Type | glucose.hist | glucose.scan | time.previous | time.new |
| ---------------- | ---- | ------------ | ------------ | ------------- | -------- |
| Time             |      |              |              |               |          |
| 2019-04-25 00:08 | 1    | NaN          | 109          | NaN           | NaN      |
| 2019-04-25 00:50 | 1    | NaN          | NaN          | NaN           | NaN      |
| 2019-04-25 07:02 | 1    | NaN          | 123          | NaN           | NaN      |
| 2019-04-25 07:34 | 1    | NaN          | 158          | NaN           | NaN      |
| 2019-04-25 08:19 | 1    | NaN          | NaN          | NaN           | NaN      |

First I have to check if my data is numeric

```python
print(df.dtypes)
```

```
Type               int64
glucose.hist     float64
glucose.scan      object
time.previous    float64
time.new         float64
dtype: object
```

glucose.scan is not so we have to change this

```python
df["glucose.scan"] = pd.to_numeric(df["glucose.scan"])
print(df.dtypes)
df.head(5)
```

```
Type               int64
glucose.hist     float64
glucose.scan     float64
time.previous    float64
time.new         float64
dtype: object
```

\
&#x20;  &#x20;

| Type             | glucose.hist | glucose.scan | time.previous | time.new |     |
| ---------------- | ------------ | ------------ | ------------- | -------- | --- |
| Time             |              |              |               |          |     |
| 2019-04-25 00:08 | 1            | NaN          | 109.0         | NaN      | NaN |
| 2019-04-25 00:50 | 1            | NaN          | NaN           | NaN      | NaN |
| 2019-04-25 07:02 | 1            | NaN          | 123.0         | NaN      | NaN |
| 2019-04-25 07:34 | 1            | NaN          | 158.0         | NaN      | NaN |
| 2019-04-25 08:19 | 1            | NaN          | NaN           | NaN      | NaN |

Now we can calculate means and fill the empty data with means

```python
df = df.fillna(df.mean())
#give them nice format
df[['glucose.scan','glucose.hist']]= df[['glucose.scan','glucose.hist']].astype(int)
df.head(5)
```

\
&#x20; &#x20;

|                  | Type | glucose.hist | glucose.scan | time.previous | time.new |
| ---------------- | ---- | ------------ | ------------ | ------------- | -------- |
| Time             |      |              |              |               |          |
| 2019-04-25 00:08 | 1    | 115          | 109          | NaN           | NaN      |
| 2019-04-25 00:50 | 1    | 115          | 123          | NaN           | NaN      |
| 2019-04-25 07:02 | 1    | 115          | 123          | NaN           | NaN      |
| 2019-04-25 07:34 | 1    | 115          | 158          | NaN           | NaN      |
| 2019-04-25 08:19 | 1    | 115          | 123          | NaN           | NaN      |

## Filling with sense

Taking the average of the entire dataset is not often a sensible approach. A more sensible approach is taking the daily average or an hourly average. We can estimate these for instance with `groupby` or `aggregate` methods&#x20;

```python
.groupby('Day').aggregate({'glucose.scan':np.mean})
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://fennaf.gitbook.io/bfvm19prog1/data-wrangling/dealing-with-missing-values.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
