Montly sales data

In this example we first generate sales data per business day at several locations over two years. Then we merge the two years into one dataframe. We resample the data to monthly means by end of month frequencies. After that we display the data to compare the average month sales of the two years.

Pandas datetime

pandas stores timestamps using NumPy’s datetime64 data type at the nanosecond resolution. We can demonstrate this by creating a pandas series object with a data range in nanosecond frequency

#pd.Series(pd.date_range('2021-07-01', periods=3, freq='D')) # day
#pd.Series(pd.date_range('2021-07-01', periods=3, freq='H')) # hour
pd.Series(pd.date_range('2021-07-01', periods=3, freq='N')) # nanosecond
0   2021-07-01 00:00:00.000000000
1   2021-07-01 00:00:00.000000001
2   2021-07-01 00:00:00.000000002
dtype: datetime64[ns]

We see that the pandas series object is of the dtype datetime64[ns] (ns = nanoseconds)

Create date_range

With date_range we can create all sorts of time intervals For example, if you wanted a date index containing the last business day of each month, you would pass the 'BM' frequency (business end of month)

dates = pd.date_range('1/1/2021', periods = 3, freq='BM')
pd.Series(dates)
0   2021-01-29
1   2021-02-26
2   2021-03-31
dtype: datetime64[ns]

Another example is a 2hour period

dates = pd.date_range('1/1/2021', periods = 3, freq='2h')
pd.Series(dates)
0   2021-01-01 00:00:00
1   2021-01-01 02:00:00
2   2021-01-01 04:00:00
dtype: datetime64[ns]

In the https://pandas.pydata.org/docs/reference/offset_frequency.html you can find more about frequencies and in the documentation https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html you can read all the methods of this datetime object.

Let us create some time related data. The data represent sales per business days at several locations

N = 250 #there are about 250 business days in a year
locations=['Assen', 'Groningen', 'Hoogeveen']
dates20 = pd.date_range('1/1/2019', periods=N, freq='B')
sales20 = pd.DataFrame(np.random.randint(7,33 ,size=(N, len(locations))), index=dates20, columns = locations)
#print(sales20.head(3))
dates21 = pd.date_range('1/1/2020', periods=N, freq='B')
sales21 = pd.DataFrame(np.random.randint(10,40 ,size=(N, len(locations))), index=dates21, columns = locations)
#print(sales21.head(3))
#concatenate the two dataframes
sales = pd.concat([sales20, sales21])
sales

Assen

Groningen

Hoogeveen

2019-01-01

31

28

19

2019-01-02

10

16

25

2019-01-03

21

25

22

2019-01-04

19

23

14

2019-01-07

19

23

10

...

...

...

...

2020-12-09

24

15

19

2020-12-10

23

20

26

2020-12-11

16

26

11

2020-12-14

21

35

24

2020-12-15

33

29

32

500 rows × 3 columns

Plot the sales data

We can plot the data just by calling pandas.DataFrame.plot()

#plot the sales of Assen
plt.figure()
sales['Assen'].plot()

This is far away from a table we would like to see. For instance a table below:

Resample

This data is not readable. We should consider Resampling. Resampling is necessary when you’re given a data set recorded in some time interval and you want to change the time interval to something else. For example, aggregate daily numbers into monthly numbers. The syntax

<DataFrame or Series>.resample(arguments).<aggregate function>
month_sales = sales.resample('M').sum() #create a dataframe with the total sales per month
month_sales.tail()

Assen

Groningen

Hoogeveen

2020-08-31

552

492

548

2020-09-30

586

490

475

2020-10-31

599

558

543

2020-11-30

618

466

488

2020-12-31

255

263

290

plt.figure()
month_sales['Assen'].plot()

We can also resample to evaluate a part of the dataset. For instance we could get the mean value of Assen en Hoogeveen combined

# get mean of Drenthe sales
month_sales_drenthe = sales.resample('M').mean().eval('Assen+Hoogeveen')
print(month_sales_drenthe.head())
#create a dataframe from the series
msd = pd.DataFrame({'date':month_sales_drenthe.index, 'average':month_sales_drenthe.values}).set_index('date')
print(msd.head())
2019-01-31    37.173913
2019-02-28    36.550000
2019-03-31    36.857143
2019-04-30    38.136364
2019-05-31    40.521739
Freq: M, dtype: float64
              average
date                 
2019-01-31  37.173913
2019-02-28  36.550000
2019-03-31  36.857143
2019-04-30  38.136364
2019-05-31  40.521739
msd.plot(kind = 'bar')
plt.tight_layout()

Locators and Formatters

This is not the kind of plot we want. Remember we can access the objects of the figure. The two relevant classes are Locators and Formatters. Locators determine where the ticks are, and formatters control the formatting of tick labels.

from matplotlib.dates import MonthLocator, YearLocator, DateFormatter
plt.figure(figsize=(8, 6))
#ax = plt.gca(), figure = plt.gcf()

#adjust titles and labels
plt.title('Verkoop in Drenthe')
plt.ylabel('gemiddeld aantallen verkocht')

#adjust thick labels
plt.gca().xaxis.set_minor_formatter(DateFormatter('%B')) #display name of month
plt.gca().xaxis.set_minor_locator(MonthLocator(interval=1, bymonthday=-1)) #every months end of month
plt.gca().xaxis.set_major_formatter(DateFormatter('\n\n\n\n\n%Y')) #display year
plt.gca().xaxis.set_major_locator(YearLocator()) 
#plot
plt.bar(x = msd.index, height = msd['average'], width = 15, color = 'grey')
# adjust limit
plt.gca().set_xlim(pd.Timestamp('2019-01-01'), pd.Timestamp('2020-12-31'))
# auto rotate
plt.gcf().autofmt_xdate(rotation=90, which = 'minor', ha="center")
plt.show()

However, we want to compare the different dates over the year per year. Let's pivot the table

msd = msd.reset_index().rename(columns={'date':'date_end'})
msd['year'] = msd.date_end.dt.year
msd['month'] = msd.date_end.dt.month
df_sales = msd.pivot(index='month', columns='year', values='average')
df_sales

year

2019

2020

month

1

37.173913

46.434783

2

36.550000

53.050000

3

36.857143

51.636364

4

38.136364

52.454545

5

40.521739

46.714286

6

36.050000

54.136364

7

39.608696

51.304348

8

39.363636

52.380952

9

45.428571

48.227273

10

34.478261

51.909091

11

37.904762

52.666667

12

40.454545

49.545455

#reset index to get tidy table
df_sales = df_sales.reset_index()
print(df_sales.columns)
Index(['month', 2019, 2020], dtype='object', name='year')
#change numbers to month name
def format_months(i):
    l = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    return l[i-1]

df_sales['month'] = df_sales['month'].apply(format_months)
print(df_sales.head(3))
year month       2019       2020
0      Jan  37.173913  46.434783
1      Feb  36.550000  53.050000
2      Mar  36.857143  51.636364
#We now have a tidy table we can plot
df_sales.plot(x = 'month', 
              y=[2019, 2020], 
              kind = 'bar', 
              color = ['lightgrey', 'grey'], 
              ylabel='gemiddeld aantallen verkocht', 
              title='Verkoop in Drenthe', 
              figsize=(8, 6))

Last updated

Was this helpful?