DateTime wrangling

Wrangling with DateTime object

In this part of the tutorial, we look at the usage of the following date time methods and attributes

pd.to_datetime() #change datatype to datetime
pd.Timestamp.now() #fetch current time
df['column'].dt.dayofyear #get the day of the year from a date
df.index.month #get the month of the index (which is a datetime)
df.groupby([df.index.month]) #group by month

Pandas has datetime objects with many handy attributes

t = pd.Timestamp.now() 
print(t.year)
print(t.month) 
print(t.day)
print(t.hour)
print(t.minute)
print(t.second)
2020
10
18
18
2
41

For demonstration purpose, we load a file with DateTime information

file = 'FSIS-Recall-Summary-2014'
df = pd.read_csv('data/{}.csv'.format(file), skiprows=1)
df.dropna()
df.head()

Recall Date

Recall Number

Recall Class

Product

Reason for Recall

Pounds Recalled

0

Jan 10 2014

001-2014

I

Mechanically Separated Chicken Products

Salmonella

33,840

1

Jan 13 2014

002-2014

I

Various Beef Products

Other

42,103

2

Jan 15 2014

003-2014

I

Beef Franks

Undeclared Allergen

2,664

3

Jan 17 2014

004-2014

II

Beef and Pork Products

Undeclared Allergen

130,000

4

Jan 17 2014

005-2014

I

Spiral Hams

Listeria monocytogenes

67,113

The Recall Date column looks like a Date Time format but it is not. We can simply transform it to a datetime format with .to_datetime method

print(df.dtypes)
Recall Date          object
Recall Number        object
Recall Class         object
Product              object
Reason for Recall    object
Pounds Recalled      object
dtype: object
df['Recall Date'] = pd.to_datetime(df['Recall Date'])
print(df.dtypes)
Recall Date          datetime64[ns]
Recall Number                object
Recall Class                 object
Product                      object
Reason for Recall            object
Pounds Recalled              object
dtype: object

Since we would like to calculate with Pounds Recalled down the road we transfer this to a float

#clean Pounds Recalled
df['Pounds Recalled'] = df['Pounds Recalled'].astype(str).str.replace(r",", "", regex=True).astype(float)
df.tail()

Recall Date

Recall Number

Recall Class

Product

Reason for Recall

Pounds Recalled

90

2014-12-27

091-2014

I

Chicken and Beef Products

Undeclared Allergen

55813.0

91

2014-12-29

092-2014

I

Beef and Pork Products

Undeclared Allergen

38400.0

92

2014-12-31

093-2014

I

Chicken Products

Undeclared Allergen

5300.0

93

2014-12-31

094-2014

I

Canned Soup Products

Undeclared Allergen

4474.0

94

NaT

NaN

NaN

NaN

NaN

NaN

Let us remove the last record since it contains NaN information

df = df[:-1]

We can add the day of year by using the attribute dayofyear

df['dayofyear'] = df['Recall Date'].dt.dayofyear
df.head()

Recall Date

Recall Number

Recall Class

Product

Reason for Recall

Pounds Recalled

day

dayofyear

0

2014-01-10

001-2014

I

Mechanically Separated Chicken Products

Salmonella

33840.0

4

10

1

2014-01-13

002-2014

I

Various Beef Products

Other

42103.0

0

13

2

2014-01-15

003-2014

I

Beef Franks

Undeclared Allergen

2664.0

2

15

3

2014-01-17

004-2014

II

Beef and Pork Products

Undeclared Allergen

130000.0

4

17

4

2014-01-17

005-2014

I

Spiral Hams

Listeria monocytogenes

67113.0

4

17

If we change the index to the DateTime column we can use the index.month, index.day, index.year attributes derived from the DateTime object. In the example below, we first set the index to ddRecaland then select the month February

#Set data as index
df = df.rename(columns = {'Recall Date': 'ddRecal'})
df.index = df['ddRecal']
df = df.drop(['ddRecal'], axis = 1)

#select month 2
df_feb = df[(df.index.month == 2)] 
df_feb

Recall Number

Recall Class

Product

Reason for Recall

Pounds Recalled

day

dayofyear

ddRecal

2014-02-01

009-2014

II

Various Meat and Poultry Products

Undeclared Allergen

144000.0

5

32

2014-02-04

010-2014

I

Beef Products

E. coli O103, O111, O121, O145, O26, O45

15865.0

1

35

2014-02-04

011-2014

I

Chicken Noodle Soup

Undeclared Allergen

54673.0

1

35

2014-02-08

012-2014

I

Beef Jerky

Processing Defect

365.0

5

39

2014-02-08

013-2014

I

Various Beef Products

Other

8742700.0

5

39

2014-02-11

014-2014

II

Beef Jerky Products

Undeclared Allergen

90000.0

1

42

2014-02-15

015-2014

I

Dried Egg Products

Salmonella

309594.0

5

46

2014-02-22

016-2014

II

Various Meat Products

Other

8895.0

5

53

2014-02-26

017-2014

I

Liver Pâté

Other

1282.0

2

57

#select first of all months
df_01 = df[(df.index.day == 1)] 
df_01

Recall Number

Recall Class

Product

Reason for Recall

Pounds Recalled

day

dayofyear

ddRecal

2014-02-01

009-2014

II

Various Meat and Poultry Products

Undeclared Allergen

144000.0

5

32

2014-04-01

019-2014

I

Chicken Products

Undeclared Allergen

8730.0

1

91

The month attribute is also handy when we want to find the month minimum value of a column.

#create a new column with month minimum of Pounds Recalled
df["min.recall"] = df.groupby([df.index.month])["Pounds Recalled"].transform('min')
df.head(10)

Recall Number

Recall Class

Product

Reason for Recall

Pounds Recalled

day

dayofyear

min.recall

ddRecal

2014-01-10

001-2014

I

Mechanically Separated Chicken Products

Salmonella

33840.0

4

10

2446.0

2014-01-13

002-2014

I

Various Beef Products

Other

42103.0

0

13

2446.0

2014-01-15

003-2014

I

Beef Franks

Undeclared Allergen

2664.0

2

15

2446.0

2014-01-17

004-2014

II

Beef and Pork Products

Undeclared Allergen

130000.0

4

17

2446.0

2014-01-17

005-2014

I

Spiral Hams

Listeria monocytogenes

67113.0

4

17

2446.0

2014-01-19

006-2014

II

Cheeseburger Mac Products

Undeclared Allergen

1770000.0

6

19

2446.0

2014-01-27

007-2014

II

Frozen Chicken Products

Undeclared Allergen

1250000.0

0

27

2446.0

2014-01-30

008-2014

I

Chicken Salad Products

Undeclared Allergen

2446.0

3

30

2446.0

2014-02-01

009-2014

II

Various Meat and Poultry Products

Undeclared Allergen

144000.0

5

32

365.0

2014-02-04

010-2014

I

Beef Products

E. coli O103, O111, O121, O145, O26, O45

15865.0

1

35

365.0

Last updated

Was this helpful?