Data exploration
Exploring Data distributions
Before we move into the analysis we need to inspect the structure and format of the data. We need to check the quality and quantity of the data. For each column in a tidy table, we should understand the content, the meaning of the column label, the used datatype, and the used units. We cannot interpret the results if we do not understand fully the context of the data and the quality of the data. A few standard questions you should always consider are
What is the structure of the data (data types, descriptive statistics)
How many features and how many observations are there available
Are there any missing data and is imputation needed?
Is my data balanced?
Are there outliers?
Are features correlated?
Is normalization needed?
Is data transformation needed?
Is data redundant?
For demonstration purpose the famous iris dataset is used to demonstrate several data exploration techniques. Seaborn and holoview are used as plotting libraries, but almost all libraries contain methods for plotting the types of figures plotted
Describe the data
So our dataframe has 5 features, the last one is the describing the class, there are no missing values and we have 150 observations
We might get a feeling for the categorical data by asking for the unique values
There are three classes
Data exploration starts with some descriptive statistics. What can we determine from ranges between minimum, maximum, the standard deviation and the mean? These are statistics that gives us a clue how the data is distributed. Descriptives analyse the location of the central tendency and the variability of the features. It is also useful to explore how the data is distributed overall.
count
150.000000
150.000000
150.000000
150.000000
mean
5.843333
3.054000
3.758667
1.198667
std
0.828066
0.433594
1.764420
0.763161
min
4.300000
2.000000
1.000000
0.100000
25%
5.100000
2.800000
1.600000
0.300000
50%
5.800000
3.000000
4.350000
1.300000
75%
6.400000
3.300000
5.100000
1.800000
max
7.900000
4.400000
6.900000
2.500000
Since we have different species it is interesting to stratify
Species
Iris-setosa
5.006
3.418
1.464
0.244
Iris-versicolor
5.936
2.770
4.260
1.326
Iris-virginica
6.588
2.974
5.552
2.026
Species
Iris-setosa
0.352490
0.381024
0.173511
0.107210
Iris-versicolor
0.516171
0.313798
0.469911
0.197753
Iris-virginica
0.635880
0.322497
0.551895
0.274650
Plotting
It is highly recommended to plot the distribution first before we assess statitically the distribution (for instance with a normality test). We can use several plots for this. Most commonly used plots are the barplot, boxplot, the frequency table, the histogram and the kernel denisty plot.
barplot
Barplot we can use for checking the balance in our dataset.
boxplot
Boxplots are based on perentiles and are a quick way to vizualize the distribution of the data. In a boxplot the median is plotted, 50% in the box and the whiskers represent the distribution edges. Any data outside the whiskers is plotted as a single point or circle, these can be considered as outliers.
We can see from the boxplot that the virginica has an outlier, and that the species differ from each other in mean and also in variance
histogram
A histogram is an approximate representation of the distribution of numerical data. Data is divided into intervals (bins) and then count how many values fall into each interval. The bins are plotted on the x-axis, the count on the y-axis. In the example below each species has a different color. By coloring the species the different distributions of the different classes is displayed in one graph.
Sometimes it is more informative to plot them seperately
We can indeed see the outlier in the virginica Sepal_length.
density plots
Density plots are a variation of Histograms. It uses a kernel to predict the count value and smoothens across the x-axis. An advantage of Density Plots over Histograms is that they’re not affected by the number of bins, and therefor better at determining the distribution shape. Normal distribution curves are an example of density plots
violin plot
You can combine the density distribution and the boxplot into a violin plot. A violin plot is an enhancement of the boxplot and plot the density estimate with the density on the y-axis. The advantage of this plot that it can show nuances in the distribution in comparison with the boxplot. The outliers however are less clearly to visualize. Best practise is to plot boxplot in the violin plot. This kind of plot is extremely informative plotting distribution, outliers, central tendency and variation.
grids
So far we only plotted one factor, the Sepal length. For exploratory data analysis we often use grids (facets) to plot all the columns into one overview. The following piece of code displays the histograms of all the numerical columns in the dataframe
pairplots
pairplots can also be used to investigate relations between factors
We can see that petal_length and petal_width are highly related
heatmap
Another way of showing relations is by a heatmap
Sepal_Length
1.000000
0.109369
0.871754
0.817954
Sepal_Width
0.109369
1.000000
0.420516
0.356544
Petal_Length
0.871754
0.420516
1.000000
0.962757
Petal_Width
0.817954
0.356544
0.962757
1.000000
Again, Petal length and Petal width are highly related
qqplot
If it is of importance that a variable is normally distributed (because of the statistical model you would like to use), you can also use a QQ-plot. A QQ-plot is used to visually determine how close a sample is to a the Normal distribution. If the points fall roughly on the diagonal line, then the samples can be considered to be distributed normal
In this example we can see that the samples are not all on the line, this might be due to the different species. Best practise is to stratify
Statistical test
The Shapiro-Wilk normality test tests if the data is normally distributed. The null-hypothesis of this test is that the population is normally distributed. Thus, if the p value is less than 0.05, then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed
If data is not normally distributed we might consider a transformation like a log transformation to make it normal. A log transformation is a process of applying a logarithm to data to reduce its skew. We can check the skewness by skew()
. Skewness however can be due to for instance different species. This skewness can only be used if we are certain of homogeneity of our samples
Log-transformation is only recommended for highly skewed data
Last updated
Was this helpful?