Tuning Model Structure – Number of Layers & Number of Nodes

Imbalance Price Forecasting is a series applying machine learning to forecasting the UK Imbalance Price.  

Last post I introduced a new version of the neural network I am building.  This new version is a feedforward fully connected neural network written in Python built using Keras.

I’m now working on tuning model hyperparameters and structure. Previously I setup two experiments looking at:

  1. Activation functions
    • concluded rectified linear (relu) is superior to tanh, sigmoid & linear.
  2. Data sources
    • concluded more data the better.

In this post I detail two more experiments:

  1. Number of layers
  2. Number of nodes per layer

Python improvements

I’ve made two improvements to my implementation of Keras.  An updated script is available on my GitHub.

I often saw during training that the model trained on the last epoch was not necessarily the best model. I have made use of a ModelCheckpoint that saves the weights of the best model trained.

The second change I have made is to include dropout layers after the input layer and each hidden layer.  This is a better implementation of dropout!

Experiment one – number of layers

Model parameters were:

  • 15,000 epochs. Trained in three batches. 10 folds cross-validation.
  • 2016 Imbalance Price & Imbalance Volume data scraped from Elexon.
  • Feature matrix of lag 48 of Price & Volume & Sparse matrix of Settlement Period, Day of the week and Month.
  • Feed-forward neural network:
    • Input layer with 1000 nodes, fully connected.
    • 0-5 hidden layers with 1000 nodes, fully connected.
    • 1-6 dropout layers. One under input & each hidden layer.  30% dropout.
    • Output layer with 1000 nodes, single output node.
    • Loss function = mean squared error.
    • Optimizer = adam (default parameters).

Results of the experiments are shown below in Fig. 1 – 3.

Figure 1 – number of layers vs final training loss
Figure 2 – number of layers vs MASE

Figure 1 shows two layers with the smallest training loss.

Figure 2 shows that two layers also has the lowest CV MASE (although has a high training MASE).

Figure 3 – number of layers vs overfit. Absolute overfit = Test-Training. Relative = Absolute / Test.

In terms of overfitting two layers shows reasonable absolute & relative overfit.  The low relative overfit is due to a high training MASE (which minimizes the overfit for a constant CV MASE).

My conclusion from this set of experiments is to go forward with a model of two layers.  Increasing the number of layers beyond this doesn’t seem to improve performance.

It is possible that training for more epochs may improve the performance of the more complex networks which will be harder to train.  For the scope of this project I am happy to settle on two layers.

Experiment two – number of nodes

For the second set of experiments all model parameters were all as above except for:

  • 2 hidden layers with 50-1000 nodes.
  • 5 fold cross validation.
Figure 4 – number of layers vs final training loss
Figure 5 – number of layers vs MASE
Figure 6 – number of layers vs overfit.  Absolute overfit = Test-Training.  Relative = Absolute / Test

My conclusion from looking at the number of nodes is that 500 nodes per layer is the optimum result.

Conclusions

Both parameters can be further optimized using the same parametric optimization.  For the scope of this work I am happy to work with the results of these experiments.

I trained a final model using the optimal parameters.  A two layer & 500 node network achieved a test MASE of 0.455 (versus the previous best of 0.477).

Table 1 – results of the final model fitted (two layers, 500 nodes per layer)

The next post in this series will look at controlling overfitting via dropout.

Leave a Reply

Your email address will not be published. Required fields are marked *