The terms normalization and standardization are sometimes used interchangeably, but they usually refer to different things. Normalization usually means to scale a variable to have a values between 0 and 1, while standardization transforms data to have a mean of zero and a standard deviation of 1.
The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. For machine learning, every dataset does not require normalization. It is required only when features have different ranges. In machine learning it is often required to scale features. For example age will have a range of 0-100 while kiloplatelets/mL might have ranges from 25100 till 800000. So, these two features are in very different ranges. In the analysis the feature platelets will intrinsically influence the result more due to its larger value. To avoid such we scale the features.
Standardization
Standardization converts the values of a given independent variable to a normal distribution with mean zero and standard deviation of 1
x′=σx−xˉ
Normalization
Also known as min-max scaling or min-max normalization. It is the simplest method and consists in rescaling the range of features to scale the range in [0, 1] or [−1, 1]. Selecting the target range depends on the nature of the data. The general formula for a min-max of [0, 1] is given as:
x′=max(x)−min(x)x−min(x)
Standardization in python
The preprocessing module of sklean provides a utility class StandardScaler that scales according to the standardization.
import pandas as pd
import numpy as np
df = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')
print(f'this dataset contains {len(df)} rows')
y = np.array(df['DEATH_EVENT']).reshape(-1, 1)
X = np.array(df.iloc[:,0:11])
print(X.shape)
print(y.shape)
If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use robust_scale and RobustScaler as drop-in replacements instead. They use more robust estimates for the center and range of your data.
import pandas as pd
import numpy as np
df = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')
print(f'this dataset contains {len(df)} rows')
y = np.array(df['DEATH_EVENT']).reshape(-1, 1)
X = np.array(df.iloc[:,0:11])
print(X.shape)
print(y.shape)
this dataset contains 299 rows
(299, 11)
(299, 1)
min-max normalization in python
# normaliseer data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler = scaler.fit(X)
X_scale = scaler.transform(X)
X_scale