# Normalization

The terms normalization and standardization are sometimes used interchangeably, but they usually refer to different things. Normalization usually means to scale a variable to have a values between 0 and 1, while standardization transforms data to have a mean of zero and a standard deviation of 1.

The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. For machine learning, every dataset does not require normalization. It is required only when features have different ranges. In machine learning it is often required to scale features. For example age will have a range of 0-100 while kiloplatelets/mL might have ranges from 25100 till 800000. So, these two features are in very different ranges. In the analysis the feature platelets will intrinsically influence the result more due to its larger value. To avoid such we scale the features.

## Standardization

Standardization converts the values of a given independent variable to a normal distribution with mean zero and standard deviation of 1

$$
\displaystyle x' = \frac{x - \bar{x}}{\sigma}
$$

## Normalization

Also known as min-max scaling or min-max normalization. It is the simplest method and consists in rescaling the range of features to scale the range in \[0, 1] or \[−1, 1]. Selecting the target range depends on the nature of the data. The general formula for a min-max of \[0, 1] is given as:

$$
\displaystyle x' =  \frac{ x-min(x)} {max(x) - min(x)}
$$

## Standardization in python

The preprocessing module of `sklean` provides a utility class `StandardScaler` that scales according to the standardization.

```python
import pandas as pd
import numpy as np
df = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')
print(f'this dataset contains {len(df)} rows')
y = np.array(df['DEATH_EVENT']).reshape(-1, 1)
X = np.array(df.iloc[:,0:11])
print(X.shape)
print(y.shape)
```

```
this dataset contains 299 rows
(299, 11)
(299, 1)
```

```python
X
```

```
array([[7.500e+01, 0.000e+00, 5.820e+02, ..., 1.300e+02, 1.000e+00,
        0.000e+00],
       [5.500e+01, 0.000e+00, 7.861e+03, ..., 1.360e+02, 1.000e+00,
              nan],
       [6.500e+01, 0.000e+00, 1.460e+02, ..., 1.290e+02, 1.000e+00,
        1.000e+00],
       ...,
       [4.500e+01, 0.000e+00, 2.060e+03, ..., 1.380e+02, 0.000e+00,
        0.000e+00],
       [4.500e+01, 0.000e+00, 2.413e+03, ..., 1.400e+02, 1.000e+00,
        1.000e+00],
       [5.000e+01, 0.000e+00, 1.960e+02, ..., 1.360e+02, 1.000e+00,
        1.000e+00]])
```

```python
# normaliseer data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler = scaler.fit(X)
X_scale = scaler.transform(X)
```

```
array([[ 1.18934352, -0.87872174, 0.23365768, ..., -1.74232053, 0.72804852, -0.6770032 ], [ 0.34147055, -0.87872174, -0.57955124, ..., -1.99760559, 0.72804852, 1.47709789], [-0.93033892, 1.13801668, -0.64483177, ..., 0.04467489, 0.72804852, -0.6770032 ], ..., [-0.50640243, -0.87872174, 2.54272337, ..., 0.555245 , -1.37353483, -0.6770032 ], [-1.35427541, -0.87872174, 3.64876211, ..., 0.81053006, 0.72804852, 1.47709789], 
[-0.93033892, -0.87872174, -0.48629334, ..., -0.21061017, 0.72804852, 1.47709789]])
```

&#x20;If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use `robust_scale` and `RobustScale`r as drop-in replacements instead. They use more robust estimates for the center and range of your data.

```python
import pandas as pd
import numpy as np
df = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')
print(f'this dataset contains {len(df)} rows')
y = np.array(df['DEATH_EVENT']).reshape(-1, 1)
X = np.array(df.iloc[:,0:11])
print(X.shape)
print(y.shape)
```

```
this dataset contains 299 rows
(299, 11)
(299, 1)
```

## min-max normalization in python

```python
# normaliseer data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler = scaler.fit(X)
X_scale = scaler.transform(X)
X_scale
```

```
array([[0.63636364, 0.        , 0.07131921, ..., 0.48571429, 1.        ,
        0.        ],
       [0.27272727, 0.        , 1.        , ..., 0.65714286, 1.        ,
               nan],
       [0.45454545, 0.        , 0.01569278, ..., 0.45714286, 1.        ,
        1.        ],
       ...,
       [0.09090909, 0.        , 0.25988773, ..., 0.71428571, 0.        ,
        0.        ],
       [0.09090909, 0.        , 0.30492473, ..., 0.77142857, 1.        ,
        1.        ],
       [0.18181818, 0.        , 0.02207196, ..., 0.65714286, 1.        ,
        1.        ]])
```

## More to read

Next to the most common methods of normalization there are a couple more. More to read:\
<https://scikit-learn.org/stable/modules/preprocessing.html>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://fennaf.gitbook.io/bfvm22prog1/normalization.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
