# Data exploration

## Exploring Data distributions

Before we move into the analysis we need to inspect the structure and format of the data. We need to check the quality and quantity of the data. For each column in a tidy table, we should understand the content, the meaning of the column label, the used datatype, and the used units. We cannot interpret the results if we do not understand fully the context of the data and the quality of the data. A few standard questions you should always consider are

* What is the structure of the data (data types, descriptive statistics)
* How many features and how many observations are there available
* Are there any missing data and is imputation needed?
* Is my data balanced?
* Are there outliers?
* Are features correlated?
* Is normalization needed?
* Is data transformation needed?
* Is data redundant?

For demonstration purpose the famous iris dataset is used to demonstrate several data exploration techniques. Seaborn and holoview are used as plotting libraries, but almost all libraries contain methods for plotting the types of figures plotted&#x20;

## Describe the data

```python
import pandas as pd
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']
iris =  pd.read_csv(csv_url, names = col_names)
print(iris.info())
```

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Sepal_Length  150 non-null    float64
 1   Sepal_Width   150 non-null    float64
 2   Petal_Length  150 non-null    float64
 3   Petal_Width   150 non-null    float64
 4   Species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None
```

So our dataframe has 5 features, the last one is the describing the class, there are no missing values and we have 150 observations

We might get a feeling for the categorical data by asking for the unique values

```python
iris.Species.unique()
```

```
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)
```

There are three classes

Data exploration starts with some descriptive statistics. What can we determine from ranges between minimum, maximum, the standard deviation and the mean? These are statistics that gives us a clue how the data is distributed. Descriptives analyse the location of the central tendency and the variability of the features. It is also useful to explore how the data is distributed overall.

```python
iris.describe()
```

|       | Sepal\_Length | Sepal\_Width | Petal\_Length | Petal\_Width |
| ----- | ------------- | ------------ | ------------- | ------------ |
| count | 150.000000    | 150.000000   | 150.000000    | 150.000000   |
| mean  | 5.843333      | 3.054000     | 3.758667      | 1.198667     |
| std   | 0.828066      | 0.433594     | 1.764420      | 0.763161     |
| min   | 4.300000      | 2.000000     | 1.000000      | 0.100000     |
| 25%   | 5.100000      | 2.800000     | 1.600000      | 0.300000     |
| 50%   | 5.800000      | 3.000000     | 4.350000      | 1.300000     |
| 75%   | 6.400000      | 3.300000     | 5.100000      | 1.800000     |
| max   | 7.900000      | 4.400000     | 6.900000      | 2.500000     |

Since we have different species it is interesting to stratify

```python
iris.groupby(['Species']).mean()
```

|                 | Sepal\_Length | Sepal\_Width | Petal\_Length | Petal\_Width |
| --------------- | ------------- | ------------ | ------------- | ------------ |
| Species         |               |              |               |              |
| Iris-setosa     | 5.006         | 3.418        | 1.464         | 0.244        |
| Iris-versicolor | 5.936         | 2.770        | 4.260         | 1.326        |
| Iris-virginica  | 6.588         | 2.974        | 5.552         | 2.026        |

```python
iris.groupby(['Species']).std()
```

|                 | Sepal\_Length | Sepal\_Width | Petal\_Length | Petal\_Width |
| --------------- | ------------- | ------------ | ------------- | ------------ |
| Species         |               |              |               |              |
| Iris-setosa     | 0.352490      | 0.381024     | 0.173511      | 0.107210     |
| Iris-versicolor | 0.516171      | 0.313798     | 0.469911      | 0.197753     |
| Iris-virginica  | 0.635880      | 0.322497     | 0.551895      | 0.274650     |

```python
iris.Species.value_counts()
```

```python
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Species, dtype: int64
```

## Plotting

It is highly recommended to plot the distribution first before we assess statitically the distribution (for instance with a normality test). We can use several plots for this. Most commonly used plots are the barplot, boxplot, the frequency table, the histogram and the kernel denisty plot.

```python
#https://holoviews.org/user_guide/Customizing_Plots.html
import hvplot.pandas
```

### barplot

Barplot we can use for checking the balance in our dataset.

```python
iris.Species.value_counts().plot(kind = 'barh', alpha = 0.5)<AxesSubplot:>
```

<figure><img src="/files/X5xIKIo53HrVXMbmLoMl" alt=""><figcaption></figcaption></figure>

###

### boxplot

Boxplots are based on perentiles and are a quick way to vizualize the distribution of the data. In a boxplot the median is plotted, 50% in the box and the whiskers represent the distribution edges. Any data outside the whiskers is plotted as a single point or circle, these can be considered as outliers.

```python
boxplot = iris.hvplot.box(y='Sepal_Length', 
                          by='Species', 
                          legend = False,
                          box_alpha = 0.5)
boxplot
```

<figure><img src="/files/5V1jKQWCuopLUwBi1K7O" alt=""><figcaption></figcaption></figure>

We can see from the boxplot that the virginica has an outlier, and that the species differ from each other in mean and also in variance

### histogram

A histogram is an approximate representation of the distribution of numerical data. Data is divided into intervals (bins) and then count how many values fall into each interval. The bins are plotted on the x-axis, the count on the y-axis. In the example below each species has a different color. By coloring the species the different distributions of the different classes is displayed in one graph.

```python
histogram = iris.hvplot.hist('Sepal_Length', 
                             by='Species', 
                             alpha = 0.5)
histogram
```

<figure><img src="/files/TqUjy57bKtQT5vpOyFfy" alt=""><figcaption></figcaption></figure>

Sometimes it is more informative to plot them seperately

```python
histogram = iris.hvplot.hist('Sepal_Length', 
                             by='Species', 
                             alpha = 0.5, 
                             subplots=True,
                             width = 250)
histogram
```

<figure><img src="/files/I72VCQuKOB42In4Oxbqh" alt=""><figcaption></figcaption></figure>

We can indeed see the outlier in the virginica Sepal\_length.

### density plots

Density plots are a variation of Histograms. It uses a kernel to predict the count value and smoothens across the x-axis. An advantage of Density Plots over Histograms is that they’re not affected by the number of bins, and therefor better at determining the distribution shape. Normal distribution curves are an example of density plots

```python
iris.hvplot.kde('Sepal_Length', 
                    by='Species')
```

<figure><img src="/files/UleIaHDoKFxcXNvlxRz3" alt=""><figcaption></figcaption></figure>

```python
iris.hvplot.kde('Sepal_Length', 
                by='Species',
                alpha = 0.5, 
                subplots=True,
                width = 250)
```

<figure><img src="/files/cAau1w8WY2gtcOJeyYRV" alt=""><figcaption></figcaption></figure>

### violin plot

You can combine the density distribution and the boxplot into a violin plot. A violin plot is an enhancement of the boxplot and plot the density estimate with the density on the y-axis. The advantage of this plot that it can show nuances in the distribution in comparison with the boxplot. The outliers however are less clearly to visualize. Best practise is to plot boxplot in the violin plot. This kind of plot is extremely informative plotting distribution, outliers, central tendency and variation.

```python
iris.hvplot.violin('Sepal_Length', 
                    by='Species', legend=False, 
                    width=700, height=500, padding=0.5)
```

<figure><img src="/files/5b5Jo2scDurdY0Wlv53Q" alt=""><figcaption></figcaption></figure>

### grids

So far we only plotted one factor, the Sepal length. For exploratory data analysis we often use grids (facets) to plot all the columns into one overview. The following piece of code displays the histograms of all the numerical columns in the dataframe

```python
iris.hist(bins=40, figsize=(15, 10), alpha = 0.5)
```

<figure><img src="/files/M4b0YCVnlz4T0VfuFxhl" alt=""><figcaption></figcaption></figure>

### pairplots

pairplots can also be used to investigate relations between factors

```python
import seaborn
seaborn.pairplot(iris, hue='Species')
```

<figure><img src="/files/71IipLhdi6B4UAOIN4fi" alt=""><figcaption></figcaption></figure>

We can see that petal\_length and petal\_width are highly related

### heatmap

Another way of showing relations is by a heatmap

```python
df = iris.drop(['Species'],axis = 1)
c = df.corr().abs()
y_range = (list(reversed(c.columns)))
x_range = (list(c.index))
c
```

|               | Sepal\_Length | Sepal\_Width | Petal\_Length | Petal\_Width |
| ------------- | ------------- | ------------ | ------------- | ------------ |
| Sepal\_Length | 1.000000      | 0.109369     | 0.871754      | 0.817954     |
| Sepal\_Width  | 0.109369      | 1.000000     | 0.420516      | 0.356544     |
| Petal\_Length | 0.871754      | 0.420516     | 1.000000      | 0.962757     |
| Petal\_Width  | 0.817954      | 0.356544     | 0.962757      | 1.000000     |

```python
sns.heatmap(c)
```

<figure><img src="/files/L5O71IKimSvQ9P3bqraH" alt=""><figcaption></figcaption></figure>

Again, Petal length and Petal width are highly related

### qqplot

If it is of importance that a variable is normally distributed (because of the statistical model you would like to use), you can also use a QQ-plot. A QQ-plot is used to visually determine how close a sample is to a the Normal distribution. If the points fall roughly on the diagonal line, then the samples can be considered to be distributed normal

```python
import statsmodels.api as sm
import matplotlib.pyplot as plt
fig = sm.qqplot(iris.Petal_Length, fit = True, line = '45')
plt.show()
```

<figure><img src="/files/VxbmDfwNv1yP488PFCWq" alt=""><figcaption></figcaption></figure>

In this example we can see that the samples are not all on the line, this might be due to the different species. Best practise is to stratify

<figure><img src="/files/HnsKsoYyUTEBCL2Z3uW8" alt=""><figcaption></figcaption></figure>

```python
fig = sm.qqplot(iris[iris.Species == 'Iris-setosa'].Petal_Length, fit = True, line = '45')
plt.show()
```

### Statistical test

The Shapiro-Wilk normality test tests if the data is normally distributed. The null-hypothesis of this test is that the population is normally distributed. Thus, if the p value is less than 0.05, then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed

```python
from scipy.stats import shapiro
shapiro(iris[iris.Species == 'Iris-setosa'].Petal_Length)
```

```
ShapiroResult(statistic=0.9549460411071777, pvalue=0.05465003103017807)
```

If data is not normally distributed we might consider a transformation like a log transformation to make it normal. A log transformation is a process of applying a logarithm to data to reduce its skew. We can check the skewness by `skew()`. Skewness however can be due to for instance different species. This skewness can only be used if we are certain of homogeneity of our samples

```python
#check skewness
iris.skew(numeric_only=True)
```

```
Sepal_Length    0.314911
Sepal_Width     0.334053
Petal_Length   -0.274464
Petal_Width    -0.104997
dtype: float64
```

Log-transformation is only recommended for highly skewed data


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://fennaf.gitbook.io/bfvm19prog1/data-exploration.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
