Dealing with missing values

Missing data occurs commonly in many data applications. Especially wearable sensor devices encounter measurement errors resulting in missing data. Although pandas can handle missing data, for instance, to exclude missing data in the descriptive analysis, we might decide to reshape our data to improve quality. We might consider to exclude certain rows, columns, or decide to impute our data. As long as we take an argumentative approach. The most common methods to handle missing data are:

.dropna(axis) # drops columns or rows (axis) when contain NaN
.fillna() # fills the missing values with a value
.isnull() # boolean indicating missing values

Mind you that it is not always sufficient just to check if a file contains NaN's. Sometimes a file used spaces as a missing value or another sign like ?. Especially in the case that a column is of the datatype string when you expect a number you should be alarmed. It might contain a string indicating missing values and therefore the entire column is set to a string datatype. If such is the case we need to change the string indicating the missing value to a NaN and we need to typecast it to the required format

df.replace(r'\s+', np.nan, regex=True) 
df["glucose.scan"] = pd.to_numeric(df["glucose.scan"])

Working example: the glucose level measurement

In the following example, we look at a file containing glucose level measurements. First, we import a dataset into a pandas DataFrame. We change the index to the time of measurement to inspect the data related to time. In this file indeed we have NaNs and spaces!

import pandas as pd
import numpy as np
df=pd.read_table('data/JamesBondGlucoseLevels25.txt')
df.head(5)

Time

Type

hist.glucose.mg.dL.

Scan.glucose.mg.dL.

Not.numeric.fast.insuline

fast.insuline.unit.

Non.numeric.food

carbonhydrates.gram.

Non.numeric.long.insuline

long.insuline.unit.

Remarks

Strip.glucose.mg.dL.

Keton.mmol.L.

meal.insuline.unit.

Correctioninsuline.unit.

userchange.insuline.unit.

previous.time

adjusted.time

561

2845

2019-04-25 00:08

NaN

109

NaN

562

2850

2019-04-25 00:50

NaN

563

2877

2019-04-25 07:02

NaN

123

NaN

564

2881

2019-04-25 07:34

NaN

158

NaN

565

2886

2019-04-25 08:19

NaN

Let us change the index since 'Time' makes more sense to use as an index

df=df.set_index('Time')
df.head(5)

Type

hist.glucose.mg.dL.

Scan.glucose.mg.dL.

Not.numeric.fast.insuline

fast.insuline.unit.

Non.numeric.food

carbonhydrates.gram.

Non.numeric.long.insuline

long.insuline.unit.

Remarks

Strip.glucose.mg.dL.

Keton.mmol.L.

meal.insuline.unit.

Correctioninsuline.unit.

userchange.insuline.unit.

previous.time

adjusted.time

Time

2019-04-25 00:08

2845

NaN

109

NaN

2019-04-25 00:50

2850

NaN

2019-04-25 07:02

2877

NaN

123

NaN

2019-04-25 07:34

2881

NaN

158

NaN

2019-04-25 08:19

2886

NaN

Now we clean the data by rename some columns and select only the needed information. I choose to fill the empty spaces with np.nan for further processing

def reformat():
    df=pd.read_table('data/JamesBondGlucoseLevels25.txt')
    df=df.set_index('Time')
    df = df.rename(columns = {
                         'hist.glucose.mg.dL.':'glucose.hist',
                         'Scan.glucose.mg.dL.':'glucose.scan',
                         'previous.time':'time.previous',
                         'adjusted.time':'time.new'})
    columns_to_keep = [
                   'Type',
                   'glucose.hist',
                   'glucose.scan',
                   'time.previous',
                   'time.new']
    df = df[columns_to_keep].replace(r'\s+', np.nan, regex=True) 
    return df

df = reformat()
df.head(5)

Type

glucose.hist

glucose.scan

time.previous

time.new

Time

2019-04-25 00:08

NaN

109

NaN

2019-04-25 00:50

NaN

2019-04-25 07:02

NaN

123

NaN

2019-04-25 07:34

NaN

158

NaN

2019-04-25 08:19

NaN

Automatically fill missing values

One of the handy functions that Pandas has for working with missing values is the filling function, fillna. This function takes a number or parameters, for instance, you could pass in a single value which is called a scalar value to change all of the missing data to one value.

df.fillna?

df = df.sort_index()
df.head(5)

Type

glucose.hist

glucose.scan

time.previous

time.new

Time

2019-04-25 00:08

NaN

109

NaN

2019-04-25 00:14

93.0

NaN

2019-04-25 00:29

95.0

NaN

2019-04-25 00:44

103.0

NaN

2019-04-25 00:50

NaN

Fill by method parameters

The two common fill values are ffill and bfill. ffillis for forward filling and it updates an NaN value for a particular cell with the value from the previous row.

It's important to note that your data needs to be sorted in order for this to have the effect you might want. Data that comes from traditional database management systems usually has no order guarantee, So be careful. First sort the data

df = df.fillna(method='ffill')
df.head(5)

Type

glucose.hist

glucose.scan

time.previous

time.new

Time

2019-04-25 00:08

NaN

109

NaN

2019-04-25 00:14

93.0

109

NaN

2019-04-25 00:29

95.0

109

NaN

2019-04-25 00:44

103.0

109

NaN

2019-04-25 00:50

103.0

109

NaN

df = df.fillna(method='bfill')
df.head(5)

Type

glucose.hist

glucose.scan

time.previous

time.new

Time

2019-04-25 00:08

93.0

109

NaN

2019-04-25 00:14

93.0

109

NaN

2019-04-25 00:29

95.0

109

NaN

2019-04-25 00:44

103.0

109

NaN

2019-04-25 00:50

103.0

109

NaN

Imputation

In the example above we filled the data with previous or following data. This might be handy in the case of few missing data. Another strategy might be to fill the missing data with calculated data (average). You could use np.mean for instance. This can be done easily since when you use statistical functions on DataFrames, these functions typically ignore missing values. For instance if you try and calculate the mean value of a DataFrame, the underlying NumPy function will ignore missing values. This is usually what you want but you should be aware that values are being excluded

df = reformat()

df.sort_index()
df.head(5)

Type

glucose.hist

glucose.scan

time.previous

time.new

Time

2019-04-25 00:08

NaN

109

NaN

2019-04-25 00:50

NaN

2019-04-25 07:02

NaN

123

NaN

2019-04-25 07:34

NaN

158

NaN

2019-04-25 08:19

NaN

First I have to check if my data is numeric

print(df.dtypes)

Type               int64
glucose.hist     float64
glucose.scan      object
time.previous    float64
time.new         float64
dtype: object

glucose.scan is not so we have to change this

df["glucose.scan"] = pd.to_numeric(df["glucose.scan"])
print(df.dtypes)
df.head(5)

Type               int64
glucose.hist     float64
glucose.scan     float64
time.previous    float64
time.new         float64
dtype: object

Type

glucose.hist

glucose.scan

time.previous

time.new

Time

2019-04-25 00:08

NaN

109.0

NaN

2019-04-25 00:50

NaN

2019-04-25 07:02

NaN

123.0

NaN

2019-04-25 07:34

NaN

158.0

NaN

2019-04-25 08:19

NaN

Now we can calculate means and fill the empty data with means

df = df.fillna(df.mean())
#give them nice format
df[['glucose.scan','glucose.hist']]= df[['glucose.scan','glucose.hist']].astype(int)
df.head(5)

Type

glucose.hist

glucose.scan

time.previous

time.new

Time

2019-04-25 00:08

115

109

NaN

2019-04-25 00:50

115

123

NaN

2019-04-25 07:02

115

123

NaN

2019-04-25 07:34

115

158

NaN

2019-04-25 08:19

115

123

NaN

Filling with sense

Taking the average of the entire dataset is not often a sensible approach. A more sensible approach is taking the daily average or an hourly average. We can estimate these for instance with groupby or aggregate methods

.groupby('Day').aggregate({'glucose.scan':np.mean})

PreviousData wrangling NextCleaning data

Last updated 4 years ago

Was this helpful?