DateTime wrangling
Wrangling with DateTime object
In this part of the tutorial, we look at the usage of the following date time methods and attributes
pd.to_datetime() #change datatype to datetime
pd.Timestamp.now() #fetch current time
df['column'].dt.dayofyear #get the day of the year from a date
df.index.month #get the month of the index (which is a datetime)
df.groupby([df.index.month]) #group by month
Pandas has datetime objects with many handy attributes
t = pd.Timestamp.now()
print(t.year)
print(t.month)
print(t.day)
print(t.hour)
print(t.minute)
print(t.second)
2020
10
18
18
2
41
For demonstration purpose, we load a file with DateTime information
file = 'FSIS-Recall-Summary-2014'
df = pd.read_csv('data/{}.csv'.format(file), skiprows=1)
df.dropna()
df.head()
Recall Date
Recall Number
Recall Class
Product
Reason for Recall
Pounds Recalled
0
Jan 10 2014
001-2014
I
Mechanically Separated Chicken Products
Salmonella
33,840
1
Jan 13 2014
002-2014
I
Various Beef Products
Other
42,103
2
Jan 15 2014
003-2014
I
Beef Franks
Undeclared Allergen
2,664
3
Jan 17 2014
004-2014
II
Beef and Pork Products
Undeclared Allergen
130,000
4
Jan 17 2014
005-2014
I
Spiral Hams
Listeria monocytogenes
67,113
The Recall Date
column looks like a Date Time format but it is not. We can simply transform it to a datetime format with .to_datetime
method
print(df.dtypes)
Recall Date object
Recall Number object
Recall Class object
Product object
Reason for Recall object
Pounds Recalled object
dtype: object
df['Recall Date'] = pd.to_datetime(df['Recall Date'])
print(df.dtypes)
Recall Date datetime64[ns]
Recall Number object
Recall Class object
Product object
Reason for Recall object
Pounds Recalled object
dtype: object
Since we would like to calculate with Pounds Recalled down the road we transfer this to a float
#clean Pounds Recalled
df['Pounds Recalled'] = df['Pounds Recalled'].astype(str).str.replace(r",", "", regex=True).astype(float)
df.tail()
Recall Date
Recall Number
Recall Class
Product
Reason for Recall
Pounds Recalled
90
2014-12-27
091-2014
I
Chicken and Beef Products
Undeclared Allergen
55813.0
91
2014-12-29
092-2014
I
Beef and Pork Products
Undeclared Allergen
38400.0
92
2014-12-31
093-2014
I
Chicken Products
Undeclared Allergen
5300.0
93
2014-12-31
094-2014
I
Canned Soup Products
Undeclared Allergen
4474.0
94
NaT
NaN
NaN
NaN
NaN
NaN
Let us remove the last record since it contains NaN information
df = df[:-1]
We can add the day of year by using the attribute dayofyear
df['dayofyear'] = df['Recall Date'].dt.dayofyear
df.head()
Recall Date
Recall Number
Recall Class
Product
Reason for Recall
Pounds Recalled
day
dayofyear
0
2014-01-10
001-2014
I
Mechanically Separated Chicken Products
Salmonella
33840.0
4
10
1
2014-01-13
002-2014
I
Various Beef Products
Other
42103.0
0
13
2
2014-01-15
003-2014
I
Beef Franks
Undeclared Allergen
2664.0
2
15
3
2014-01-17
004-2014
II
Beef and Pork Products
Undeclared Allergen
130000.0
4
17
4
2014-01-17
005-2014
I
Spiral Hams
Listeria monocytogenes
67113.0
4
17
If we change the index to the DateTime column we can use the index.month
, index.day
, index.year
attributes derived from the DateTime object. In the example below, we first set the index to ddRecal
and then select the month February
#Set data as index
df = df.rename(columns = {'Recall Date': 'ddRecal'})
df.index = df['ddRecal']
df = df.drop(['ddRecal'], axis = 1)
#select month 2
df_feb = df[(df.index.month == 2)]
df_feb
Recall Number
Recall Class
Product
Reason for Recall
Pounds Recalled
day
dayofyear
ddRecal
2014-02-01
009-2014
II
Various Meat and Poultry Products
Undeclared Allergen
144000.0
5
32
2014-02-04
010-2014
I
Beef Products
E. coli O103, O111, O121, O145, O26, O45
15865.0
1
35
2014-02-04
011-2014
I
Chicken Noodle Soup
Undeclared Allergen
54673.0
1
35
2014-02-08
012-2014
I
Beef Jerky
Processing Defect
365.0
5
39
2014-02-08
013-2014
I
Various Beef Products
Other
8742700.0
5
39
2014-02-11
014-2014
II
Beef Jerky Products
Undeclared Allergen
90000.0
1
42
2014-02-15
015-2014
I
Dried Egg Products
Salmonella
309594.0
5
46
2014-02-22
016-2014
II
Various Meat Products
Other
8895.0
5
53
2014-02-26
017-2014
I
Liver Pâté
Other
1282.0
2
57
#select first of all months
df_01 = df[(df.index.day == 1)]
df_01
Recall Number
Recall Class
Product
Reason for Recall
Pounds Recalled
day
dayofyear
ddRecal
2014-02-01
009-2014
II
Various Meat and Poultry Products
Undeclared Allergen
144000.0
5
32
2014-04-01
019-2014
I
Chicken Products
Undeclared Allergen
8730.0
1
91
The month attribute is also handy when we want to find the month minimum value of a column.
#create a new column with month minimum of Pounds Recalled
df["min.recall"] = df.groupby([df.index.month])["Pounds Recalled"].transform('min')
df.head(10)
Recall Number
Recall Class
Product
Reason for Recall
Pounds Recalled
day
dayofyear
min.recall
ddRecal
2014-01-10
001-2014
I
Mechanically Separated Chicken Products
Salmonella
33840.0
4
10
2446.0
2014-01-13
002-2014
I
Various Beef Products
Other
42103.0
0
13
2446.0
2014-01-15
003-2014
I
Beef Franks
Undeclared Allergen
2664.0
2
15
2446.0
2014-01-17
004-2014
II
Beef and Pork Products
Undeclared Allergen
130000.0
4
17
2446.0
2014-01-17
005-2014
I
Spiral Hams
Listeria monocytogenes
67113.0
4
17
2446.0
2014-01-19
006-2014
II
Cheeseburger Mac Products
Undeclared Allergen
1770000.0
6
19
2446.0
2014-01-27
007-2014
II
Frozen Chicken Products
Undeclared Allergen
1250000.0
0
27
2446.0
2014-01-30
008-2014
I
Chicken Salad Products
Undeclared Allergen
2446.0
3
30
2446.0
2014-02-01
009-2014
II
Various Meat and Poultry Products
Undeclared Allergen
144000.0
5
32
365.0
2014-02-04
010-2014
I
Beef Products
E. coli O103, O111, O121, O145, O26, O45
15865.0
1
35
365.0
Last updated
Was this helpful?