Dealing with missing values
Missing data occurs commonly in many data applications. Especially wearable sensor devices encounter measurement errors resulting in missing data. Although pandas can handle missing data, for instance, to exclude missing data in the descriptive analysis, we might decide to reshape our data to improve quality. We might consider to exclude certain rows, columns, or decide to impute our data. As long as we take an argumentative approach. The most common methods to handle missing data are:
Mind you that it is not always sufficient just to check if a file contains NaN's. Sometimes a file used spaces as a missing value or another sign like ?
. Especially in the case that a column is of the datatype string when you expect a number you should be alarmed. It might contain a string indicating missing values and therefore the entire column is set to a string datatype. If such is the case we need to change the string indicating the missing value to a NaN and we need to typecast it to the required format
Working example: the glucose level measurement
In the following example, we look at a file containing glucose level measurements. First, we import a dataset into a pandas DataFrame. We change the index to the time of measurement to inspect the data related to time. In this file indeed we have NaNs and spaces!
ID
Time
Type
hist.glucose.mg.dL.
Scan.glucose.mg.dL.
Not.numeric.fast.insuline
fast.insuline.unit.
Non.numeric.food
carbonhydrates.gram.
Non.numeric.long.insuline
long.insuline.unit.
Remarks
Strip.glucose.mg.dL.
Keton.mmol.L.
meal.insuline.unit.
Correctioninsuline.unit.
userchange.insuline.unit.
previous.time
adjusted.time
561
2845
2019-04-25 00:08
1
NaN
109
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
562
2850
2019-04-25 00:50
1
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
563
2877
2019-04-25 07:02
1
NaN
123
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
564
2881
2019-04-25 07:34
1
NaN
158
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
565
2886
2019-04-25 08:19
1
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Let us change the index since 'Time' makes more sense to use as an index
ID
Type
hist.glucose.mg.dL.
Scan.glucose.mg.dL.
Not.numeric.fast.insuline
fast.insuline.unit.
Non.numeric.food
carbonhydrates.gram.
Non.numeric.long.insuline
long.insuline.unit.
Remarks
Strip.glucose.mg.dL.
Keton.mmol.L.
meal.insuline.unit.
Correctioninsuline.unit.
userchange.insuline.unit.
previous.time
adjusted.time
Time
2019-04-25 00:08
2845
1
NaN
109
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2019-04-25 00:50
2850
1
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2019-04-25 07:02
2877
1
NaN
123
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2019-04-25 07:34
2881
1
NaN
158
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2019-04-25 08:19
2886
1
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Now we clean the data by rename some columns and select only the needed information. I choose to fill the empty spaces with np.nan
for further processing
Type
glucose.hist
glucose.scan
time.previous
time.new
Time
2019-04-25 00:08
1
NaN
109
NaN
NaN
2019-04-25 00:50
1
NaN
NaN
NaN
NaN
2019-04-25 07:02
1
NaN
123
NaN
NaN
2019-04-25 07:34
1
NaN
158
NaN
NaN
2019-04-25 08:19
1
NaN
NaN
NaN
NaN
Automatically fill missing values
One of the handy functions that Pandas has for working with missing values is the filling function, fillna
. This function takes a number or parameters, for instance, you could pass in a single value which is called a scalar value to change all of the missing data to one value.
Type
glucose.hist
glucose.scan
time.previous
time.new
Time
2019-04-25 00:08
1
NaN
109
NaN
NaN
2019-04-25 00:14
0
93.0
NaN
NaN
NaN
2019-04-25 00:29
0
95.0
NaN
NaN
NaN
2019-04-25 00:44
0
103.0
NaN
NaN
NaN
2019-04-25 00:50
1
NaN
NaN
NaN
NaN
Fill by method parameters
The two common fill values are ffill
and bfill
. ffill
is for forward filling and it updates an NaN value for a particular cell with the value from the previous row.
It's important to note that your data needs to be sorted in order for this to have the effect you might want. Data that comes from traditional database management systems usually has no order guarantee, So be careful. First sort the data
Type
glucose.hist
glucose.scan
time.previous
time.new
Time
2019-04-25 00:08
1
NaN
109
NaN
NaN
2019-04-25 00:14
0
93.0
109
NaN
NaN
2019-04-25 00:29
0
95.0
109
NaN
NaN
2019-04-25 00:44
0
103.0
109
NaN
NaN
2019-04-25 00:50
1
103.0
109
NaN
NaN
Type
glucose.hist
glucose.scan
time.previous
time.new
Time
2019-04-25 00:08
1
93.0
109
NaN
NaN
2019-04-25 00:14
0
93.0
109
NaN
NaN
2019-04-25 00:29
0
95.0
109
NaN
NaN
2019-04-25 00:44
0
103.0
109
NaN
NaN
2019-04-25 00:50
1
103.0
109
NaN
NaN
Imputation
In the example above we filled the data with previous or following data. This might be handy in the case of few missing data. Another strategy might be to fill the missing data with calculated data (average). You could use np.mean
for instance. This can be done easily since when you use statistical functions on DataFrames, these functions typically ignore missing values. For instance if you try and calculate the mean value of a DataFrame, the underlying NumPy function will ignore missing values. This is usually what you want but you should be aware that values are being excluded
.
Type
glucose.hist
glucose.scan
time.previous
time.new
Time
2019-04-25 00:08
1
NaN
109
NaN
NaN
2019-04-25 00:50
1
NaN
NaN
NaN
NaN
2019-04-25 07:02
1
NaN
123
NaN
NaN
2019-04-25 07:34
1
NaN
158
NaN
NaN
2019-04-25 08:19
1
NaN
NaN
NaN
NaN
First I have to check if my data is numeric
glucose.scan is not so we have to change this
Type
glucose.hist
glucose.scan
time.previous
time.new
Time
2019-04-25 00:08
1
NaN
109.0
NaN
NaN
2019-04-25 00:50
1
NaN
NaN
NaN
NaN
2019-04-25 07:02
1
NaN
123.0
NaN
NaN
2019-04-25 07:34
1
NaN
158.0
NaN
NaN
2019-04-25 08:19
1
NaN
NaN
NaN
NaN
Now we can calculate means and fill the empty data with means
Type
glucose.hist
glucose.scan
time.previous
time.new
Time
2019-04-25 00:08
1
115
109
NaN
NaN
2019-04-25 00:50
1
115
123
NaN
NaN
2019-04-25 07:02
1
115
123
NaN
NaN
2019-04-25 07:34
1
115
158
NaN
NaN
2019-04-25 08:19
1
115
123
NaN
NaN
Filling with sense
Taking the average of the entire dataset is not often a sensible approach. A more sensible approach is taking the daily average or an hourly average. We can estimate these for instance with groupby
or aggregate
methods
Last updated
Was this helpful?