Ga naar hoofdinhoud

Scaling

In this section, we are not looking at non-linear scalers.

Why Scaling

Models base their prediction on the values of the input features. Features can be unfairly treated if they are at different scales.

For example with k nearest neighbours: samples with similar values are predicted to have the same value. So when the values of one feature are much smaller than an the other features, the model will base its almost soiley on just the one small feature.

Left: true labels. Right: predicted values trained on unscaled values. As you can see: a model can be more sensitive to a feature when it's improperly scaled

Scaling input features

Scaling the input features has multiple effects:

  • Removing bias towards a feature
  • Making the range of values compatible with initiated trainable parameters values
  • Increasing the interperability of the trained weights

Scaling targets

Scaling the target is less often necesary because there is often just one target to predict. But it can be useful when the y values is very high due to how the trainable parameters are initialed.

Scaling methods

Three different methods of scaling. keft: using the min and max values, middle: using standard deviation and the mean. Right: using the quantiles and the median.

Normalisation

Probably the most straight forward way to scale data is by simply setting the highest value to 1 and the lowest value to 0. And this works quite well to a certain extent. However it's extremely sensitive to outliers, which makes is very unstable.

x=xmin(x)max(x)min(x)x' = \frac{x - \min(x)}{\max(x) - \min(x)}
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaler.fit(X_data)
X_data_scaled = scaler.transform(X_data)
# X_data_scaled = scaler.fit_transform(X_data)

Standardisation

Standard deviation

A common way to measure variability in a dataset is by calculating the standard deviation. This statistic tells us how much individual data points deviate from the mean. The formula for standard deviation (SD or σ\sigma) is:

σ=i=1n(xiμ)2n1\sigma = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n-1}}

where:

  • μ\mu is the mean (average) of the data,
  • nn is the number of samples,
  • xix_i represents each individual data point.

Degrees of freedom

You might notice the n1n-1 in the denominator instead of just nn. This is not a mistake—it's an important correction.

In statistics, there are two common variations of standard deviation:

  • Dividing by nn (population standard deviation)
  • Dividing by n1n-1 (sample standard deviation)

Why do we do this? Well, what we want is the following:

std(σwholepopulation)std(σsubsetpopulation)std(\sigma_{whole population}) \approx std(\sigma_{subset population})

However, the chance that you get samples close to the middle is much larger than the chance of getting more remote samples. This results in a higher likelyhood of having a lower standard deviation estimation when the sample size is small.

To somewhat compensate for the bias to a lower bias, the n1n-1 (Bessel's adjustment) is introduced.

σ=i=1n(xiμ)2ndof\sigma = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n-dof}}

Lines with different degree of freedom settings. Lines converge when sample size increases. degree of freedom = 1 converges the fastest.Lines with different degree of freedom settings. Lines converge when sample size increases. degree of freedom = 1 converges the fastest.
Calculations of the standard devation at different samples sizes and different degree of freedom values (ddof in python). A
Tip

Pandas uses 1n1\frac{1}{n-1} and scipy, Scikit-learn and numpy both use 1n\frac{1}{n} (even though the Scipy agrees that 1n1\frac{1}{n-1} is the more correct one but keeping it way for backwards compability reasons). You can change this by setting ddof=1

A very common way to scale the data is by scaling the values down to the z values. It's a bit more rebust to the biggest and smallest values but still quite sensitive to them due to the quadretic component.

x=xμσx' = \frac{x - \mu}{\sigma}
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X_data)
X_data_scaled = scaler.transform(X_data)
# X_data_scaled = scaler.fit_transform(X_data)

Robust scaling

Quick Ref: Interquartile range
  1. Find the 25th percentile (Q1) and the 75th percentile (Q3) of your data.

  2. Calculate the IQR: IQR=Q3Q1IQR = Q3 - Q1

  3. Define the outlier boundaries:

    • Lower Bound: LowerBound=Q11.5IQRLower Bound = Q1 - 1.5 * IQR
    • Upper Bound: UpperBound=Q3+1.5IQRUpper Bound = Q3 + 1.5 * IQR

A less common way it to scale using the quantiles, the IQR and the median. However, it's much more stable to outliers then the other two methods.

x=xmedian(x)IQR(x)IQR(x)=Q3(x)Q1(x)\begin{align*} x' &= \frac{x - \text{median}(x)}{\text{IQR}(x)} \\ \text{IQR}(x) &= Q_3(x) - Q_1(x) \end{align*}
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

scaler.fit(X_data)
X_data_scaled = scaler.transform(X_data)
# X_data_scaled = scaler.fit_transform(X_data)

Other scalers

normalisation and standardisation are by far the most populair once. But, just as we have the robust scaler, there are more: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html

Leave your thoughts

Rating