Tuning regularization strength

This post is the fifth in a series applying machine learning techniques to an energy problem. The goal of this series is for me to teach myself machine learning techniques by developing models to forecast the UK Imbalance Price

I see huge promise in what machine learning can do in the energy industry.  This series details my initial efforts in gaining an understanding of machine learning.  

Part One – What is the UK Imbalance Price?
Part Two – Elexon API Web Scraping using Python
Part Three – Imbalance Price Visualization
Part Four – Multilayer Perceptron Neural Network

In the previous post in this series we introduced a Multi Layer Perceptron neural network to predict the UK Imbalance Price.   This post will dig a bit deeper into optimizing the degree of overfitting of our model.  We do this through tuning the strength of regularization.

What is regularization

Regularization is a tool used to combat the problem of overfitting a model.  Overfitting occurs when a model starts to fit the training data too well – meaning that performance on unseen data is poor.

To prevent overfitting to the training data we can try to keep the model parameters small using regularization.  If we include a regularization term in the cost function the model minimizes we can encourage the model to use smaller parameters.

The equation below shows the loss function minimized during model training.  The first term is the square of the error.  The second term is the regularization term – with lambda shown as the parameter to control regularization.  In order to be consistent with scikit-learn, we will refer to this parameter as alpha.

Regularization penalizes large values of the model parameters (theta) based on the size of the regularization parameter.  Regularization comes in two flavours – L1 and L2.  The MLP Regressor model in scikit-learn uses L2 regularization.

Setting alpha too large will result in underfitting (also known as a high bias problem).  Setting alpha too small may lead to overfitting (a high variance problem).

Setting alpha in the UK Imbalance Price model

Here we will optimize alpha by iterating through a number of different values.

We can then evaluate the degree of overfitting by looking at how alpha affects the loss function and the Mean Absolute Scaled Error (MASE).  The loss function is the cost function the model minimizes during training.  The MASE is the metric we used to judge model performance.

We use K-fold cross validation to get a sense of the degree of overfitting.  Comparing the cross validation to training performance gives us an idea of how much our model is overfitting.  Using K-fold cross validation allows us to leave the test data free for evaluating model performance only.

Figure 1 & 2 show the results of optimizing alpha for a MLP Regressor with five hidden layers of 1344 nodes each.  The input feature set is the previous one week of Imbalance Price data.

Figure 1 shows the effect of alpha on the loss function for the training and cross validation sets.  We would expect to see the training loss increase as alpha increases.  Small values of alpha should allow the model to overfit.  We would also expect to see the loss for the training and CV sets to coverge as alpha gets large.

Figure 1 – The effect of alpha on the loss function

Figure 1 shows the expected trend with the training loss increasing as alpha increases – except for alpha = 0.0001 which shows a high training loss.  This I don’t understand!  I was expecting that training loss would decrease with decreasing alpha.

Figure 1 shows the effect of alpha on the Mean Absolute Squared Error for the training, cross validation and test sets.

Figure 2 – The effect of alpha on the MASE for the training, cross validation and test data

Figure 2 also shows a confusing result.  I was expecting to see the MASE be a minimum at the smallest alpha and increase as alpha increased.  This is because small values of alpha should allow the model to overfit (and improve performance). Instead we see that the best training MASE is at alpha = 0.01.

Figure 2 shows a minimum for the test MASE at alpha = 0.01 – this is also the minimum for the training data.

Going forward I will be using a value of 0.01 for alpha as this shows a good balance between minimizing the loss for the training and cross validation sets.

Table 1 shows the results for the model as it currently stands.

Table 1 – Model performance with alpha = 0.01

Training MASE0.3345
Cross validation MASE0.5890
Test MASE0.5212

Next step in this project is looking at previously unseen data for December 2016 – stay tuned.