Cleaning data
Cleaning typo's
Suppose we want to sum the data in a column. We could use a sum, but we get a type error, since '53,2' is a string. In this case, we cannot simply transfer the data to a float. We need first to set the column as a string type, replace the ',' by a '.' (by a regex) and then transfer the column to float type
Convert values
Sometimes we do not want to use the 0 or 1 for a category but more meaningful values. This is especially handy in the graphical analysis. With the method map()
we can change values easily
Convert units
Pandas builts upon Numpy and the pandas DataFrames can deal with vectorized manipulations. The regex replace example above is an example of a vectorized manipulation. In the same way, we can convert the values of an entire column or DataFrame.
Rename columns
Data columns should describe as best as possible the content of the columns. For that purpose, we can rename columns by means of a dictionary
Drop rows or columns
Sometimes we want to drop columns because they are considered useless for the analysis. Often these columns are indexes or IDs. Maybe we want to drop the column because there is too much data missing or of poor quality.
Select/slice by value or by index
The df[df['columnname'] == condition]]
construct is to select all the records containing all the columns with that condition and not only the column name column.
Cleaning data example
In the example below all comes together. We have to read the data, deal with changing units, NaNs and useless information. The file we consider is the Energy Indicators file.
Unnamed: 0
Unnamed: 1
Unnamed: 2
Unnamed: 3
Unnamed: 4
Unnamed: 5
0
NaN
NaN
Environmental Indicators: Energy
NaN
NaN
NaN
1
NaN
NaN
NaN
NaN
NaN
NaN
2
NaN
NaN
Energy Supply and Renewable Electricity Produc...
NaN
NaN
NaN
3
NaN
NaN
NaN
NaN
NaN
NaN
4
NaN
NaN
NaN
NaN
NaN
Last update: December 2015
5
NaN
NaN
NaN
NaN
NaN
NaN
6
NaN
NaN
Choose a country from the following drop-down ...
NaN
NaN
Andorra
7
NaN
NaN
NaN
NaN
NaN
NaN
8
NaN
NaN
Country
Energy Supply
Energy Supply per capita
Renewable Electricity Production
9
NaN
NaN
NaN
Petajoules
Gigajoules
%
10
NaN
NaN
Andorra
9
121
88.6957
11
NaN
NaN
NaN
NaN
NaN
NaN
12
NaN
NaN
NaN
NaN
NaN
website: http://unstats.un.org/unsd/ENVIRONMEN...
13
NaN
NaN
NaN
NaN
NaN
NaN
14
NaN
NaN
NaN
NaN
NaN
NaN
15
NaN
NaN
NaN
Energy Supply
Energy Supply per capita
Renewable Electricity Production
16
NaN
NaN
NaN
Petajoules
Gigajoules
%
17
NaN
Afghanistan
Afghanistan
321
10
78.6693
18
NaN
Albania
Albania
102
35
100
19
NaN
Algeria
Algeria
1959
51
0.55101
In the example above we can see that data of the file 'Energy indicators' need to be cleaned. The first two columns do not have added value, they can be removed. The column labels need to be replaced. The first 15 rows are headers basically, we can skip these. The Energy Supply
column is converted to gigajoules (there are 1,000,000 gigajoules in a petajoule). For all countries which have missing data (e.g. data with "...") we need to fill that with np.NaN
values. Lastly, some of the country names are renamed for merge purpose (source: University of Michigan)
Country
Energy Supply
Energy Supply per Capita
% Renewable
17
Afghanistan
3.210000e+08
10.0
78.669280
18
Albania
1.020000e+08
35.0
100.000000
19
Algeria
1.959000e+09
51.0
0.551010
20
American Samoa
NaN
NaN
0.641026
21
Andorra
9.000000e+06
121.0
88.695650
Last updated
Was this helpful?