Cleaning data

Cleaning typo's

Suppose we want to sum the data in a column. We could use a sum, but we get a type error, since '53,2' is a string. In this case, we cannot simply transfer the data to a float. We need first to set the column as a string type, replace the ',' by a '.' (by a regex) and then transfer the column to float type

df['column'] = df['column'].replace(r',', '.', regex=True) 
df['column'] = pd.to_numeric(df['column'])
print(df.dtypes)

Convert values

Sometimes we do not want to use the 0 or 1 for a category but more meaningful values. This is especially handy in the graphical analysis. With the method map() we can change values easily

df['sex'] = df['sex'].astype('category') # make the format categorical
df['sex'] = df['sex'].map({0:"Woman", 1: "Man"}) # map the values to the category

Mind you that it is very pythonic to sequence the commands

df['sex'] = df['sex'].astype('category').map({0:"Woman", 1: "Man"})

Convert units

Pandas builts upon Numpy and the pandas DataFrames can deal with vectorized manipulations. The regex replace example above is an example of a vectorized manipulation. In the same way, we can convert the values of an entire column or DataFrame.

df['Energy Supply']*= 1000000 #Convert Energy Supply to gigajoules 

Rename columns

Data columns should describe as best as possible the content of the columns. For that purpose, we can rename columns by means of a dictionary

.rename(columns = {'Unnamed: 1':'time',
                   'Unnamed: 2':'glucose_level'})

Drop rows or columns

Sometimes we want to drop columns because they are considered useless for the analysis. Often these columns are indexes or IDs. Maybe we want to drop the column because there is too much data missing or of poor quality.

.drop(['column1', 'column2'], axis = 1) #drop column
.dropna(axis = 0) #drop rows with NaNs

Select/slice by value or by index

#by value
df.loc['fenna'] # returns all the rows with index value 'fenna'
df[df['age'] == 50] # returns all records with age equals 50
df[(df['platelets'] > 100) | (df['diabetes'] > 0)] # all platelats > 100 or diabetes

#by column name
df.loc[:,['Name1', 'Name2']] # returns all rows but only column Name1 and Name2
df['Name1', 'Name2'] # returns all rows but only column Name1 and Name2

#by index
df.iloc[0] # returns first row
df.iloc[0:12] # returns rows 1 till 11
df.iloc[:, [1, 3]] # returns all rows and columns index 1 and 3
df.iloc[2, [1, 3]] # returns row with index 2 and columns index 1 and 3

The df[df['columnname'] == condition]] construct is to select all the records containing all the columns with that condition and not only the column name column.

Please consult the documentation, cheat sheet, or the tutorial for more information

Cleaning data example

In the example below all comes together. We have to read the data, deal with changing units, NaNs and useless information. The file we consider is the Energy Indicators file.

df = pd.read_excel('data/Energy Indicators.xls') # source: http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls
df.head(20)

Unnamed: 0

Unnamed: 1

Unnamed: 2

Unnamed: 3

Unnamed: 4

Unnamed: 5

0

NaN

NaN

Environmental Indicators: Energy

NaN

NaN

NaN

1

NaN

NaN

NaN

NaN

NaN

NaN

2

NaN

NaN

Energy Supply and Renewable Electricity Produc...

NaN

NaN

NaN

3

NaN

NaN

NaN

NaN

NaN

NaN

4

NaN

NaN

NaN

NaN

NaN

Last update: December 2015

5

NaN

NaN

NaN

NaN

NaN

NaN

6

NaN

NaN

Choose a country from the following drop-down ...

NaN

NaN

Andorra

7

NaN

NaN

NaN

NaN

NaN

NaN

8

NaN

NaN

Country

Energy Supply

Energy Supply per capita

Renewable Electricity Production

9

NaN

NaN

NaN

Petajoules

Gigajoules

%

10

NaN

NaN

Andorra

9

121

88.6957

11

NaN

NaN

NaN

NaN

NaN

NaN

12

NaN

NaN

NaN

NaN

NaN

website: http://unstats.un.org/unsd/ENVIRONMEN...

13

NaN

NaN

NaN

NaN

NaN

NaN

14

NaN

NaN

NaN

NaN

NaN

NaN

15

NaN

NaN

NaN

Energy Supply

Energy Supply per capita

Renewable Electricity Production

16

NaN

NaN

NaN

Petajoules

Gigajoules

%

17

NaN

Afghanistan

Afghanistan

321

10

78.6693

18

NaN

Albania

Albania

102

35

100

19

NaN

Algeria

Algeria

1959

51

0.55101

In the example above we can see that data of the file 'Energy indicators' need to be cleaned. The first two columns do not have added value, they can be removed. The column labels need to be replaced. The first 15 rows are headers basically, we can skip these. The Energy Supply column is converted to gigajoules (there are 1,000,000 gigajoules in a petajoule). For all countries which have missing data (e.g. data with "...") we need to fill that with np.NaN values. Lastly, some of the country names are renamed for merge purpose (source: University of Michigan)

def clean_energy_data():
    import pandas as pd
    import numpy as np

    # Clean energy 
    energy = (pd.read_excel('data/Energy Indicators.xls') # load Energy Indicator excelsheet into dataframe
                    .iloc[17:243].drop(['Unnamed: 0', 'Unnamed: 1'], axis=1) #drop first 16 rows
                    .rename(columns = {
                         'Unnamed: 2':'Country',
                         'Unnamed: 3':'Energy Supply',
                         'Unnamed: 4':'Energy Supply per Capita',
                         'Unnamed: 5':'% Renewable'})
                    .replace(to_replace='...', value=np.NaN)) #convert '...' to np.NaN

    energy['Energy Supply']*= 1000000 #Convert Energy Supply to gigajoules 
    energy['Country'] = (energy['Country'].str.replace(r"Republic of Korea","South Korea")
                                          .str.replace(r"United States of America","United States")
                                          .str.replace(r"United Kingdom of Great Britain and Northern Ireland", "United Kingdom")
                                          .str.replace(r"China, Hong Kong Special Administrative Region", "Hong Kong")
                                          .str.replace(r"\s\([a-zA-Z\s]+\)", "") # remove parenthesis from countrynames
                                          .str.replace(r"[0-9]+", ""))# remove numbers from countryname

    return energy


df_energy = clean_energy_data()
df_energy.head()

Country

Energy Supply

Energy Supply per Capita

% Renewable

17

Afghanistan

3.210000e+08

10.0

78.669280

18

Albania

1.020000e+08

35.0

100.000000

19

Algeria

1.959000e+09

51.0

0.551010

20

American Samoa

NaN

NaN

0.641026

21

Andorra

9.000000e+06

121.0

88.695650

Last updated

Was this helpful?