DateTime wrangling
Wrangling with DateTime object
In this part of the tutorial, we look at the usage of the following date time methods and attributes
pd.to_datetime() #change datatype to datetime
pd.Timestamp.now() #fetch current time
df['column'].dt.dayofyear #get the day of the year from a date
df.index.month #get the month of the index (which is a datetime)
df.groupby([df.index.month]) #group by monthPandas has datetime objects with many handy attributes
t = pd.Timestamp.now()
print(t.year)
print(t.month)
print(t.day)
print(t.hour)
print(t.minute)
print(t.second)2020
10
18
18
2
41For demonstration purpose, we load a file with DateTime information
file = 'FSIS-Recall-Summary-2014'
df = pd.read_csv('data/{}.csv'.format(file), skiprows=1)
df.dropna()
df.head()Recall Date
Recall Number
Recall Class
Product
Reason for Recall
Pounds Recalled
0
Jan 10 2014
001-2014
I
Mechanically Separated Chicken Products
Salmonella
33,840
1
Jan 13 2014
002-2014
I
Various Beef Products
Other
42,103
2
Jan 15 2014
003-2014
I
Beef Franks
Undeclared Allergen
2,664
3
Jan 17 2014
004-2014
II
Beef and Pork Products
Undeclared Allergen
130,000
4
Jan 17 2014
005-2014
I
Spiral Hams
Listeria monocytogenes
67,113
The Recall Date column looks like a Date Time format but it is not. We can simply transform it to a datetime format with .to_datetime method
print(df.dtypes)Recall Date object
Recall Number object
Recall Class object
Product object
Reason for Recall object
Pounds Recalled object
dtype: objectdf['Recall Date'] = pd.to_datetime(df['Recall Date'])print(df.dtypes)Recall Date datetime64[ns]
Recall Number object
Recall Class object
Product object
Reason for Recall object
Pounds Recalled object
dtype: objectSince we would like to calculate with Pounds Recalled down the road we transfer this to a float
#clean Pounds Recalled
df['Pounds Recalled'] = df['Pounds Recalled'].astype(str).str.replace(r",", "", regex=True).astype(float)df.tail()Recall Date
Recall Number
Recall Class
Product
Reason for Recall
Pounds Recalled
90
2014-12-27
091-2014
I
Chicken and Beef Products
Undeclared Allergen
55813.0
91
2014-12-29
092-2014
I
Beef and Pork Products
Undeclared Allergen
38400.0
92
2014-12-31
093-2014
I
Chicken Products
Undeclared Allergen
5300.0
93
2014-12-31
094-2014
I
Canned Soup Products
Undeclared Allergen
4474.0
94
NaT
NaN
NaN
NaN
NaN
NaN
Let us remove the last record since it contains NaN information
df = df[:-1]We can add the day of year by using the attribute dayofyear
df['dayofyear'] = df['Recall Date'].dt.dayofyear
df.head()Recall Date
Recall Number
Recall Class
Product
Reason for Recall
Pounds Recalled
day
dayofyear
0
2014-01-10
001-2014
I
Mechanically Separated Chicken Products
Salmonella
33840.0
4
10
1
2014-01-13
002-2014
I
Various Beef Products
Other
42103.0
0
13
2
2014-01-15
003-2014
I
Beef Franks
Undeclared Allergen
2664.0
2
15
3
2014-01-17
004-2014
II
Beef and Pork Products
Undeclared Allergen
130000.0
4
17
4
2014-01-17
005-2014
I
Spiral Hams
Listeria monocytogenes
67113.0
4
17
If we change the index to the DateTime column we can use the index.month, index.day, index.year attributes derived from the DateTime object. In the example below, we first set the index to ddRecaland then select the month February
#Set data as index
df = df.rename(columns = {'Recall Date': 'ddRecal'})
df.index = df['ddRecal']
df = df.drop(['ddRecal'], axis = 1)
#select month 2
df_feb = df[(df.index.month == 2)]
df_febRecall Number
Recall Class
Product
Reason for Recall
Pounds Recalled
day
dayofyear
ddRecal
2014-02-01
009-2014
II
Various Meat and Poultry Products
Undeclared Allergen
144000.0
5
32
2014-02-04
010-2014
I
Beef Products
E. coli O103, O111, O121, O145, O26, O45
15865.0
1
35
2014-02-04
011-2014
I
Chicken Noodle Soup
Undeclared Allergen
54673.0
1
35
2014-02-08
012-2014
I
Beef Jerky
Processing Defect
365.0
5
39
2014-02-08
013-2014
I
Various Beef Products
Other
8742700.0
5
39
2014-02-11
014-2014
II
Beef Jerky Products
Undeclared Allergen
90000.0
1
42
2014-02-15
015-2014
I
Dried Egg Products
Salmonella
309594.0
5
46
2014-02-22
016-2014
II
Various Meat Products
Other
8895.0
5
53
2014-02-26
017-2014
I
Liver Pâté
Other
1282.0
2
57
#select first of all months
df_01 = df[(df.index.day == 1)]
df_01
Recall Number
Recall Class
Product
Reason for Recall
Pounds Recalled
day
dayofyear
ddRecal
2014-02-01
009-2014
II
Various Meat and Poultry Products
Undeclared Allergen
144000.0
5
32
2014-04-01
019-2014
I
Chicken Products
Undeclared Allergen
8730.0
1
91
The month attribute is also handy when we want to find the month minimum value of a column.
#create a new column with month minimum of Pounds Recalled
df["min.recall"] = df.groupby([df.index.month])["Pounds Recalled"].transform('min')
df.head(10)Recall Number
Recall Class
Product
Reason for Recall
Pounds Recalled
day
dayofyear
min.recall
ddRecal
2014-01-10
001-2014
I
Mechanically Separated Chicken Products
Salmonella
33840.0
4
10
2446.0
2014-01-13
002-2014
I
Various Beef Products
Other
42103.0
0
13
2446.0
2014-01-15
003-2014
I
Beef Franks
Undeclared Allergen
2664.0
2
15
2446.0
2014-01-17
004-2014
II
Beef and Pork Products
Undeclared Allergen
130000.0
4
17
2446.0
2014-01-17
005-2014
I
Spiral Hams
Listeria monocytogenes
67113.0
4
17
2446.0
2014-01-19
006-2014
II
Cheeseburger Mac Products
Undeclared Allergen
1770000.0
6
19
2446.0
2014-01-27
007-2014
II
Frozen Chicken Products
Undeclared Allergen
1250000.0
0
27
2446.0
2014-01-30
008-2014
I
Chicken Salad Products
Undeclared Allergen
2446.0
3
30
2446.0
2014-02-01
009-2014
II
Various Meat and Poultry Products
Undeclared Allergen
144000.0
5
32
365.0
2014-02-04
010-2014
I
Beef Products
E. coli O103, O111, O121, O145, O26, O45
15865.0
1
35
365.0
Last updated
Was this helpful?
