Statistics in Python

This is a brief overview of statistics in python. In data science we always inspect our data using descriptive statistics and descriptive plots. Such statistics can be used of course for visualisations or dashboards as well. The statistical analysis can be done with a number of tests, depending on the characteristics of the data and the research question to be answered. This notebook refers to the most important ones

  • Practical: Descriptive statistics

  • Graphical: Descriptive plots

  • Analytical: Statistical analysis

#import libraries
import numpy as np
import pandas as pd
#import scipy 
#import statsmodels

Descriptive statistics

Let us create some data for demonstration purpose. We will put the data in a pandas dataframe since pandas has some nice numpy methods built ins, like mean(), sum(), max(),min() etc. It can even deliver the descriptive statistics at once with describe()

#series of values with weights
x = [8.0, 1, 2.5, 4, 28.0]
w = [0.1, 0.2, 0.3, 0.25, 0.15]
X = pd.DataFrame({'measurement':x, 'weights':w})
print((1 + 2.5 + 4 + 8 + 28) / 5 )
print(X.measurement.mean())
8.7
8.7

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

measurement
weights

count

5.00000

5.000000

mean

8.70000

0.200000

std

11.09955

0.079057

min

1.00000

0.100000

25%

2.50000

0.150000

50%

4.00000

0.200000

75%

8.00000

0.250000

max

28.00000

0.300000

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

count
mean
std

measurement

5.0

8.7

11.10

weights

5.0

0.2

0.08

Decriptive plots

We can also use the built in plots for our explatory data analyses. Like boxplot(), hist(), plot.kde() or just plot(). Seaborn has some nice plots as well

png
png
png
png
png

Analytical statistics

Normality check with Shapiro-Wilk Test

It is good practice to check for normality. The Shapiro-Wilk Test is a good test for checking normality

source: https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/

More statistics

There are a number of cheatsheets and tutorials on the internet. The next overview is a compact overview of tutorials

  • https://www.kaggle.com/hamelg/python-for-data-21-descriptive-statistics

  • https://www.kaggle.com/hamelg/python-for-data-22-probability-distributions

  • https://www.kaggle.com/hamelg/python-for-data-23-confidence-intervals

  • https://www.kaggle.com/hamelg/python-for-data-24-hypothesis-testing

  • https://www.kaggle.com/hamelg/python-for-data-25-chi-squared-tests

  • https://www.kaggle.com/hamelg/python-for-data-26-anova/notebook

  • https://www.kaggle.com/hamelg/python-for-data-27-linear-regression

Last updated