In this example we first generate sales data per business day at several locations over two years. Then we merge the two years into one dataframe. We resample the data to monthly means by end of month frequencies. After that we display the data to compare the average month sales of the two years.
Pandas datetime
pandas stores timestamps using NumPy’s datetime64 data type at the nanosecond resolution. We can demonstrate this by creating a pandas series object with a data range in nanosecond frequency
We see that the pandas series object is of the dtype datetime64[ns] (ns = nanoseconds)
Create date_range
With date_range we can create all sorts of time intervals For example, if you wanted a date index containing the last business day of each month, you would pass the 'BM' frequency (business end of month)
dates = pd.date_range('1/1/2021', periods = 3, freq='BM')
pd.Series(dates)
Let us create some time related data. The data represent sales per business days at several locations
N = 250 #there are about 250 business days in a year
locations=['Assen', 'Groningen', 'Hoogeveen']
dates20 = pd.date_range('1/1/2019', periods=N, freq='B')
sales20 = pd.DataFrame(np.random.randint(7,33 ,size=(N, len(locations))), index=dates20, columns = locations)
#print(sales20.head(3))
dates21 = pd.date_range('1/1/2020', periods=N, freq='B')
sales21 = pd.DataFrame(np.random.randint(10,40 ,size=(N, len(locations))), index=dates21, columns = locations)
#print(sales21.head(3))
#concatenate the two dataframes
sales = pd.concat([sales20, sales21])
sales
Assen
Groningen
Hoogeveen
2019-01-01
31
28
19
2019-01-02
10
16
25
2019-01-03
21
25
22
2019-01-04
19
23
14
2019-01-07
19
23
10
...
...
...
...
2020-12-09
24
15
19
2020-12-10
23
20
26
2020-12-11
16
26
11
2020-12-14
21
35
24
2020-12-15
33
29
32
500 rows × 3 columns
Plot the sales data
We can plot the data just by calling pandas.DataFrame.plot()
#plot the sales of Assen
plt.figure()
sales['Assen'].plot()
This is far away from a table we would like to see. For instance a table below:
Resample
This data is not readable. We should consider Resampling. Resampling is necessary when you’re given a data set recorded in some time interval and you want to change the time interval to something else. For example, aggregate daily numbers into monthly numbers. The syntax
<DataFrame or Series>.resample(arguments).<aggregate function>
month_sales = sales.resample('M').sum() #create a dataframe with the total sales per month
month_sales.tail()
Assen
Groningen
Hoogeveen
2020-08-31
552
492
548
2020-09-30
586
490
475
2020-10-31
599
558
543
2020-11-30
618
466
488
2020-12-31
255
263
290
plt.figure()
month_sales['Assen'].plot()
We can also resample to evaluate a part of the dataset. For instance we could get the mean value of Assen en Hoogeveen combined
# get mean of Drenthe sales
month_sales_drenthe = sales.resample('M').mean().eval('Assen+Hoogeveen')
print(month_sales_drenthe.head())
#create a dataframe from the series
msd = pd.DataFrame({'date':month_sales_drenthe.index, 'average':month_sales_drenthe.values}).set_index('date')
print(msd.head())
2019-01-31 37.173913
2019-02-28 36.550000
2019-03-31 36.857143
2019-04-30 38.136364
2019-05-31 40.521739
Freq: M, dtype: float64
average
date
2019-01-31 37.173913
2019-02-28 36.550000
2019-03-31 36.857143
2019-04-30 38.136364
2019-05-31 40.521739
msd.plot(kind = 'bar')
plt.tight_layout()
Locators and Formatters
This is not the kind of plot we want. Remember we can access the objects of the figure. The two relevant classes are Locators and Formatters. Locators determine where the ticks are, and formatters control the formatting of tick labels.
from matplotlib.dates import MonthLocator, YearLocator, DateFormatter
plt.figure(figsize=(8, 6))
#ax = plt.gca(), figure = plt.gcf()
#adjust titles and labels
plt.title('Verkoop in Drenthe')
plt.ylabel('gemiddeld aantallen verkocht')
#adjust thick labels
plt.gca().xaxis.set_minor_formatter(DateFormatter('%B')) #display name of month
plt.gca().xaxis.set_minor_locator(MonthLocator(interval=1, bymonthday=-1)) #every months end of month
plt.gca().xaxis.set_major_formatter(DateFormatter('\n\n\n\n\n%Y')) #display year
plt.gca().xaxis.set_major_locator(YearLocator())
#plot
plt.bar(x = msd.index, height = msd['average'], width = 15, color = 'grey')
# adjust limit
plt.gca().set_xlim(pd.Timestamp('2019-01-01'), pd.Timestamp('2020-12-31'))
# auto rotate
plt.gcf().autofmt_xdate(rotation=90, which = 'minor', ha="center")
plt.show()
However, we want to compare the different dates over the year per year. Let's pivot the table
#change numbers to month name
def format_months(i):
l = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
return l[i-1]
df_sales['month'] = df_sales['month'].apply(format_months)
print(df_sales.head(3))
year month 2019 2020
0 Jan 37.173913 46.434783
1 Feb 36.550000 53.050000
2 Mar 36.857143 51.636364
#We now have a tidy table we can plot
df_sales.plot(x = 'month',
y=[2019, 2020],
kind = 'bar',
color = ['lightgrey', 'grey'],
ylabel='gemiddeld aantallen verkocht',
title='Verkoop in Drenthe',
figsize=(8, 6))