Category Archives: Machine Learning

energy_py update – July 2017

energy_py is a collection of reinforcement learning agents and environments for energy systems. You can read the introductory blog post for the project and check out the repo on GitHub.

Saving of memory and value function after each episode

This quality of life improvement has a major impact on the effectiveness of training agents using energy_py. It means an agent can keep learning from experience that occurred during a different training session.

As I train models on my local machine I often can only dedicate enough time for 10 episodes of training. Saving the memory & value functions an agent to learn from hundreds of episodes without training every episode in one run.

Running each episode on a different time series

Training agents with randomly selected weeks in the year. It’s much more useful for an agent to experience two different weeks of CHP operation than to experience the same week over and over again. It also should help the agent to generalize to operating data sets it hasn’t seen before.

Double Q-Learning

Building another agent has been a todo for energy_py for a long time. I’ve built a Double Q-Leaner – based on the algorithm given in Sutton & Barto. The key extension in Double Q-Learning is to maintain two value functions.

The policy is generated using the average of the estimate for both Q networks. One network is then randomly selected for training using a target created by the other network.

The thinking behind Double Q-Learning is that we can avoid the maximization bias of Q-Learning. A positive bias is caused by the use of maximization operations for estimating the value of states. The maximization functions lead to overoptimistic estimates of the value of state actions.

Next major tasks are:
1 – build a policy gradient method – most likely a Monte Carlo policy gradient algorithm,
2 – build a demand side response environment.

Thanks for reading!

A Glance at Q-Learning

‘A Glance at Q-Learning’ is a talk I recently gave at the Data Science Festival in London. The talk was one I also gave in Berlin at the Berlin Machine Learning group.

Q-Learning is a reinforcement learning algorithm that DeepMind used to play Atari games – work which some call the first step towards a general artificial intelligence. The original 2013 paper is available here (I cover this paper in the talk).

It was a wonderful experience being able to present – I recommend checking out more of the talks on the Data Science Festival YouTube – all of which are higher quality, more interesting and better presented than mine!

You can download a copy of my slides here – A Glance at Q-Learning slides.

Thanks for reading!

machine learning in energy – part two

This is the first of a two part series on the intersection of machine learning and the energy industry.

machine learning in energy – part one
Introduction, why it’s so exciting, challenges.

machine learning in energy – part two
Time series forecasting, energy disaggregation, reinforcement learning, Google data
centre optimization.


This post will detail three exciting applications of machine learning in energy:
– forecasting of electricity generation, consumption and price
– energy disaggregation
– reinforcement learning

We will also take a look at one of the most famous applications of machine learning in an energy system – Google’s work in their own data centers.

Forecasting of electricity generation, consumption and price

What’s the problem

In a modern electricity system time has a massive economic and environmental impact. The temporal variation in electricity generation and consumption can be significant. Periods of high consumption means generating electricity using expensive & inefficient peaking plants. In periods of low consumption electricity can be so abundant that the price becomes negative.

Electric grid stability requires a constant balance between generation and consumption. Understanding future balancing actions requires accurate forecasts by the system operator.

Our current energy transition is not making the forecasting problem any eaiser. We are moving away from dispatchable, centralized and large-scale generation towards intermittent, distributed and small scale generation.

Our current energy transition is moving us away from dispatchable, centralized and large-scale generation towards intermittent, distributed and small scale generation.

Historically the majority of generation was dispatchable and predictable – making forecasting easy. The only uncertainty was plant outages for unplanned maintenance.

Intermittent generation is by nature hard to forecast. Wind turbine power generation depends on forecasting wind speeds over vast areas. Solar power is more predictable but can still see variation as cloud cover changes.

As grid scale wind & solar penetration increase balancing the grid is more difficult. Higher levels of renewables can lead to more fossil fuel backup kept in reserve in case forecasts are wrong.

It’s not just forecasting of generation that has become more challenging.  The distributed and small scale of many wind & solar plants is also making consumption forecasting more difficult.

A solar panel sitting on a residential home is not directly metered – the system operator has no idea it is there. As this solar panel generates throughout the day it appears to the grid as reduced consumption.

Our current energy transition is a double whammy for grid balancing. Forecasting of both generation and consumption is becoming more challenging.

This has a big impact on electricity prices. In a wholesale electricity market price is set by the intersection of generation and consumption. Volatility and uncertainty on both sides spill over into more volatile electricity prices.

How machine learning will help

Many supervised machine learning models can be used for time series forecasting. Both regression and classification models are able to help understand the future.

Regression models can directly forecast electricity generation, consumption and price. Classification models can forecast the probability of a spike in electricity prices.

Well trained random forests, support vector machines and neural networks can all be used to solve these problems.

A key challenge is data. As renewables are weather driven forecasts of weather can be useful exogenous variables. It’s key that we only train models on data that will be available at the time of the forecast. This means that historical information about weather forecasts can be more useful than the actual weather data.

What’s the value to the world

Improving forecasts allows us to better balance the grid, reduce fossil fuels and increase renewables.

It’s not only the economic & environmental cost of keeping backup plant spinning. Incorrect forecasts can lead to fossil fuel generators paid to reduce output. This increases the cost to supply electricity to customers.

There are benefits for end consumers of electricity as well. Improved prediction can also allow flexible electricity consumption to respond to market signals.

More accurate forecasts that can look further ahead will allow more electricity consumers to be flexible. Using flexible assets to manage the grid will reduce our reliance on fossil fuels for grid balancing.

Sources and further reading

– Forecasting UK Imbalance Price using a Multilayer Perceptron Neural Network
Machine Learning in Energy (Fayadhoi Ibrahima)
7 reasons why utilities should be using machine learning
Germany enlists machine learning to boost renewables revolution
Weron (2014) Electricity price forecasting: A review of the state-of-the-art with a look into the future

Energy disaggregation

What’s the problem

Imagine if every time you went to the restaurant you only got the total bill. Understanding the line by line breakdown of where your money went is valuable. Energy disaggregation can help give customers this level of infomation about their utility bill.

Energy disaggregation estimates appliance level consumption using only total consumption.

In an ideal world we would have visibility of each individual consumer of energy. We would know when a TV is on or a pump is running in an industrial process. One solution would be to install metering on every consumer – a very expensive and complex process.

Energy disaggregation is a more elegant solution. A good energy disaggregation model can estimate appliance level consumption through a single aggregate meter.

How machine learning will help

Supervised machine learning is all about learning patterns in data. Many supervised machine learning algorithms can learn the patterns in the total consumption. Kelly & Knottenbelt (2015) used recurrent and convolutional neural networks to disaggregate residential energy consumptions.

A key challenge is data. Supervised learning requires labeled training data. Measurement and identification of sub-consumers forms training data for a supervised learner. Data is also required at a very high temporal frequency – ideally less than one second.

What’s the value to the world

Energy disaggregation has two benefits for electricity consumers. It can identify & verify savings opportunities. It can also increase customer engagement.

Imagine if you got an electricity bill that told you how much it cost you to run your dishwasher that month. The utility could help customers understand what they could have saved if they ran their dishwasher at different times.

This kind of feedback can be very effective in increasing customer engagement – a key challenge for utilities around the world.

Sources and further reading

7 reasons why utilities should be using machine learning
Neural NILM: Deep Neural Networks Applied to Energy Disaggregation
– Energy Disaggregation: The Holy Grail (Carrie Armel)
– Putting Energy Disaggregation Tech to the Test

Reinforcement learning

What’s the problem

Controlling energy systems is hard. Key variables such as price and energy consumption constantly change. Operators control systems with a large number of actions, with the optimal action changing throughout the day.

Our current energy transition is making this problem even harder. The transition is increasing volatility in key variables (such as electricity prices) and the number of actions to choose from.

Today deterministic sets of rules or abstract models are used to guide operation. Deterministic rules for operating any non-stationary system can’t guarantee optimality. Changes in key variables can turn a profitable operation to one that loses money.

Abstract models (such as linear programming) can account for changes in key variables. But abstract models often force the use of unrealistic models of energy systems. More importantly the performance of the model is limited by the skill and experience of the modeler.

How machine learning will help

Reinforcement learning gives a machine the ability to learn to take actions. The machine takes actions in an environment to optimize a reward signal. In the context of an energy system that reward signal could be energy cost, carbon or safety – whatever behavior we want to incentivize.

reinforcement_learning_in_energy

What is exciting about reinforcement learning is that we don’t need to build any domain knowledge into the model. A reinforcement learner learns from its own experience of the environment. This allows a reinforcement learner to see patterns that we can’t see – leading to superhuman levels of performance.

Another exciting thing about reinforcement learning is that you don’t need a data set. All you need is an environment (real or virtual) that the learner can interact with.

What’s the value to the world

Better control of our energy systems will allow us to reduce cost, reduce environmental impact and improve safety. Reinforcement learning allows us to do this at superhuman levels of performance.

Sources and further reading

– energy_py – reinforcement learning in energy systems
Minh et. al (2016) Human-level control through deep reinforcement learning
Reinforcement learning course by David Silver (Google DeepMind)

Alphabet/Google data centre optimization

One of the most famous applications of machine learning in an energy system is Google’s work in their own data centers.

In 2014 Google used supervised machine learning to predict the Power Usage Effectiveness (PUE) of data centres.

This supervised model did no control of its own. Operators used the predictive model to create a target PUE for the plant. The predictive model also allowed operators to simulate the impact of changes in key parameters on PUE.

In 2016 DeepMind published details of a how they applied machine learning to optimizing data centre efficiency. The technical details of this implementation are not as clear as the 2014 work. It is pretty clear that both supervised and reinforcement learning techniques were used.

The focus on the project again was on improving PUE. Deep neural networks predicted future PUE as well as future temperatures & pressures. The predictions of future temperature & pressures simulated the effect of recommended actions.

DeepMind claim a ’40 percent reduction in the amount of energy used for cooling’ which equates to a ’15 percent reduction in overall PUE overhead after accounting for electrical losses and other non-cooling inefficiencies’. Without seeing actual data it’s hard to know exactly what this means.

What I am able to understand is that this ‘produced the lowest PUE the site had ever seen’.

This is why as an energy engineer I’m so excited about machine learning. Google’s data centers were most likely well optimized before these projects. The fact that machine learning was able to improve PUE beyond what human operators had been able to achieve before is inspiring.

The potential level of savings across the rest of our energy systems is exciting to think about. The challenges & impact of our energy systems are massive – we need the intelligence of machine learning to help us solve these challenges.

Sources and further reading

Jim Gao (Google) – Machine Learning Applications for Data Center Optimization
– DeepMind AI Reduces Google Data Centre Cooling Bill by 40%

Thanks for reading!

 

machine learning in energy – part one

This is the first of a two part series on the intersection of machine learning and energy.

machine learning in energy – part one
Introduction, why it’s so exciting, challenges.

machine learning in energy – part two
Time series forecasting, energy disaggregation, reinforcement learning, Google data
centre optimization.


Technological innovation, environmental politics and international relations all influence the development of our global energy system.  There is one less visible trend that will be one of the most important.  Machine learning is blowing past previous barriers for a wide range of problems.  Computer vision, language processing and decision making have all been revolutionized by machine learning.

I see machine learning as fundamentally new.  Mankind developed using only the intelligence in his own brain until we learned to communicate.  Since then the development of technologies such as the printing press or the internet now allow us to access the intelligence across the entire human species.  But machine learning is something different.  We can now access the intelligence of another species – machines.

Part One of this series will introduce what machine learning is, why it’s so exciting and some of the challenges of modern machine learning. Part Two goes into detail on applications of machine learning in the energy industry such as forecasting or energy disaggregation.

What is machine learning

Machine learning gives computers the ability to learn without being explicitly programmed. Computers use this ability to learn patterns in large, high-dimensionality datasets. Seeing these patterns allows computers to achieve results at superhuman levels – literally better than what a human expert can achieve.  This ability has now made machine learning the state of the art for a wide range of problems.

To demonstrate what is different about machine learning, we can compare two landmark achievements in computing & artificial intelligence.

In 1996 IBM’s Deep Blue defeated World Chess Champion Gary Kasparov. IBMs Deep Blue ‘derived it’s playing strength mainly from brute force computing power’. But all of Deep Blue’s intelligence originated in the brains of a team of programmers and chess Grandmasters.

In 2016 Alphabet’s Alpha Go defeated Go legend Lee Sedol 4-1. AlphaGo also made use of a massive amount of computing power. But the key difference is that AlphaGo was not given any information about the game of Go from its programmers. Alpha Go used reinforcement learning to give Alpha Go the ability to learn from its own experience of the game.

Both of these achievements are important landmarks in computing and artificial intelligence. Yet they are also fundamentally different because machine learning allowed AlphaGo to learn on it’s own.

Why now

Three broad trends have led to machine learning being the powerful force it is today.

One – Data

It’s hard to overestimate the importance of data to modern machine learning. Larger data sets tend to make machine learning models more powerful. A weaker algorithm with more data can outperform a stronger algorithm with less data.

The internet has brought about a massive increase in the growth rate of data. This volume of data is enabling machine learning models to achieve superhuman performance.

For many large technology companies such as Alphabet or Facebook their data has become a major source of the value of their businesses. A lot of this value comes from the insights that machines can learn from such large data sets.

Two – Hardware

There are two distinct trends in hardware that have been fundamental to moving modern machine learning forward.

The first is the use of graphics processing units (GPUs) and the second is the increased availability of computing power.

In the early 2000’s computer scientists innovated the use of graphics cards originally designed for gamers for machine learning. They discovered massive increases in training times – reducing them from months to weeks or even days.

This speed up is important. Most of our understanding of machine learning is empirical . This knowledge is built up a lot faster by reducing the iteration time for training machine learning models.

The second trend is the availability of computing power. Platforms such as Amazon Web Services or Google Cloud allow on-demand access to a large amount of GPU-enabled computing power.

Access to computing power on demand allows more companies to build machine learning products. It enables companies to shift a capital expense (building data centres) into an operating expense, with all the balance sheet benefits that brings.

Three – Algorithms & tools

I debated whether to include this third trend. It’s really the first two trends (data & hardware) that have unlocked the latent power of machine learning algorithms, many of which are decades old. Yet I still think it’s worth touching on algorithms and tools.

Neural networks form the basis of many state of the art machine learning applications. Neural networks with multiple layers of non-linear processing units (known as deep learning) that forms the backbone of the most impressive applications of machine learning today. These artificial neural networks are inspired by the biological neural networks inside our brains.

Convolutional neural networks have revolutionised computer vision through a design based on the structure of our own visual cortex. Recurrent neural networks (specifically the LSTM implementation) have transformed sequence & natural language processing by allowing the network to hold state and ‘remember’.

Another key trend in machine learning algorithms is the availability of open source tools. Companies such as Alphabet or Facebook make many of their machine learning tools all open source and available.

It’s important to note that while these technology companies share their tools, they don’t share their data. This is because data is the crucial element in producing value from machine learning. World-class tools and computing power are not enough to deliver value from machine learning – you need data to make the magic happen.

Challenges

Any powerful technology has downsides and drawbacks.

By this point in the article the importance of data to modern machine learning is clear. In fact large datasets are so important for supervised machine learning algorithms used today that it is a weakness. Many techniques don’t work on small datasets.

Human beings are able to learn from small amounts of training data – burning yourself once on the oven is enough to learn not to touch it again. Many machine learning algorithms are not able to learn in this way.

Another problem in machine learning is interpretability. A model such as a neural network doesn’t immediately lend itself to explanation. The high dimensionality of the input and parameter space means that it’s hard to pin down cause to effect. This can be difficult when considering using a machine learner in a real world system. It’s a challenge the financial industry is struggling with at the moment.

Related to this is the challenge of a solid theoretical understanding. Many academics and computer scientists are uncomfortable with machine learning. We can empirically test if machine learning is working, but we don’t really know why it is working.

Worker displacement from the automation of jobs is a key challenge for humanity in the 21st century. Machine learning is not required for automation, but it will magnify the impact of automation. Political innovations (such as the universal basic income) are needed to fight the inequality that could emerge from the power of machine learning.

I believe it is possible for us to deploy automation and machine learning while increasing the quality of life for all of society. The move towards a machine intelligent world will be a positive one if we share the value created.

In the context of the energy industry, the major challenge is digitization.  The energy system is notoriously poor at managing data, so full digitalization is still needed.  By full digitalization I mean a system where everything from sensor level data to prices are accessible to employees & machines, worldwide in near real time.

It’s not about having a local site plant control system and historian setup. The 21st-century energy company should have all data available in the cloud in real time. This will allow machine learning models deployed to the cloud to help improve the performance of our energy system. It’s easier to deploy a virtual machine in the cloud than to install & maintain a dedicated system on site.

Data is one of the most strategic assets a company can own. It’s valuable not only because of the insights it can generate today, but also the value that will be created in the future. Data is an investment that will pay off.

Part Two of this series goes into detail on specific applications of machine learning in the energy industry – forecasting, energy disaggregation and reinforcement learning.  We also take a look at one of the most famous applications of machine learning in an energy system – Google’s work in their own data centers.

Thanks for reading!

Sources and further reading

energy_py – reinforcement learning in energy systems

energy_py is reinforcement learning in energy systems.  It’s reinforcement learning agents and environments built in Python.

I have a vision of using reinforcement learners to optimally operate energy systems.  energy_py is a step towards this vision.  I’ve built this because I’m so excited about the potential of reinforcement learning in the energy industry.

Reinforcement learning in energy systems requires first proving the concepts in a virtual environment.  This project demonstrates the ability of reinforcement learning to control a virtual energy environment.

What is reinforcement learning

supervised vs unsupervised vs reinforcement

Reinforcement learning is the branch of machine learning where learning occurs through action.  Reinforcement learning will give us the tools to operate our energy systems at superhuman levels of performance.

It’s quite different from supervised learning. In supervised learning we start out with a big data set of features and our target. We train a model to replicate this target from patterns in the data.

In reinforcement learning we start out with no data. The agent generates data by interacting with the environment. The agent then learns patterns in this data. These patterns help the agent to choose actions that maximize total reward.

Why do we need reinforcement learning in energy systems

Optimal operation of energy assets is already very challenging. Our current energy transition is making this difficult problem even harder. The rise of intermittent and distributed generation is introducing volatility and increasing the number of actions available to operators.

For a wide range of problems machine learning results are both state of the art and better than human experts. We can get this level of performance using reinforcement learning in our energy systems.

Today many operators use rules or abstract models to dispatch assets. A set of rules is not able to guarantee optimal operation in many energy systems.

Optimal operating strategies can be developed from abstract models. Yet abstract models (such as linear programming) are often constrained. These models are limited to approximations of the actual plant.  Reinforcement learners are able to learn directly from their experience of the actual plant.

Reinforcement learning can also deal with non-linearity. Most energy systems exhibit non-linear behavior (in fact an energy balance is bi-linear!). Reinforcement learning can model non-linearity using neural networks. It is also able to deal with the non-stationary and hidden environment in many energy systems.

beautiful wind turbines

There are challenges to be overcome. The first and most important is safety. Safety is the number one concern in any engineering discipline. What is important to understand is we limit the actions available to the agent. All lower levels or systems of controls would remain exactly the same.

There is also the possibility to design the reward function to incentivize safety. A well-designed reinforcement learner could actually reduce hazards to operators.

A final challenge worth addressing is the impact such a learner could have on employment. Machine learning is not a replacement for human operators. A reinforcement learner would not need a reduction in employees to be a good investment.

The value of using a reinforcement learner is to let operations teams do their jobs better. It will allow them to spend more time and improve performance for their remaining responsibilities such as maintaining the plant.  The value created here is a better-maintained plant and a happier workforce – in a plant that is operating with superhuman levels of economic and environmental performance.

Any machine requires downtime – a reinforcement learner is no different. There will still be time periods where the plant will operate in manual or semi-automatic modes with human guidance.

energy_py is one step on a long journey of getting reinforcement learners helping us in the energy industry. The fight against climate change is the greatest that humanity faces. Reinforcement learning will be a key ally in fighting it.

 

energy_py is built using Python.  You can checkout the repository on GitHub here.

Initial Results

I’m really looking forward to getting to know energy_py.  There are a number of parameters to tune. For example the structure of the environment or the design of the reward function can be modified to make the reinforcement learning problem more challenging.

One key design choice is the number of assets the agent has to control.  The more choices available to the agent the more complex the shape of the value function becomes.  To approximate a more complex value function we may need a more complex neural network.

There is also a computational cost incurred with increasing the number of actions.  More actions means more [state, actions] to consider during action selection and training (both of which require value function predictions).

So far I’ve been experimenting with an environment based on two assets – a 7.5 MWe gas turbine and a 7.5 MWe gas engine.  The episode length is set to 336 steps (one week).  I run a single naive episode, 30 ε-greedy episodes and a single greedy episode.
Figure 1 below shows the total reward per episode increasing as the agent improves it’s estimate of the value function and spends less time exploring.
Figure 1 – Epsilon decay and total reward per episode

Figure 2 shows the Q-test and the network training history.  Q-test is the average of three random [state, actions] evaluated by the value function.  It shows how the value function approximation changes over time.

Figure 2 – Q-Test and the network training history

Figure 3 shows some energy engineering outputs.  I’m pretty happy with this operating regieme – the model is roughly following both the electricity price and the heat demand which is expected behaviour.

Figure 3 – Energy engineering outputs for the final greedy run

 One interesting decision to make is how often to improve the approximation of the value function.  David Silver makes the point that you should make use of the freshest data when training – hence training after each step through the environment.  He also makes the point that you don’t need to fully train the network – just train a ‘little bit’.

This makes sense as the distribution of the data (the replay memory) will change as the learner trains it’s value function.  I train on a 64 sample batch of the replay memory.  I perform 100 passes over the entire data set (i.e. 100 epochs).   Both values can be optimized.  It could make sense to train for more epochs in later episodes as we want to fit the data more than in earlier episodes.

Another challenge in energy_py is balancing exploration versus exploitation.  The Q_learner algorithm handles this dilemma using ε-greedy action selection.  I decay epsilon at a fixed rate – the optimal selection of this parameter is something I’ll take a look at in the future.

There are many exciting innovations developed recently in reinforcement learning that I’m keen to add to energy_py.  One example is the idea of Prioritized Experience Replay – where the batch is not taken randomly from the replay memory but instead prioritizes some samples over others.

It’s unlikely that I’ll ever catch up to the state of the art in reinforcement learning – what I hope of find is that we don’t need state of the art techniques to get superhuman performance from energy systems!

My reinforcement learning journey

I’m a chemical engineer by training (B.Eng, MSc) and an energy engineer by profession. I’m really excited about the potential of machine learning in the energy industry – in fact that’s what this blog is about!

My understanding of reinforcement learning has come from a variety of resources. I’d like to give credit to all of the wonderful resources I’ve used to understand reinforcement learning.

Sutton & Barto – Reinforcement Learning: An Introduction – the bible of reinforcement learning and a classic machine learning text.

Playing Blackjack with Monte Carlo Methods – I built my first reinforcement learning model to operate a battery using this post as a guide. This post is part two of an excellent three part series. Many thanks to Brandon of Δ ℚuantitative √ourney.

RL Course by David Silver – over 15 hours of lectures from Google DeepMind’s lead programmer – David Silver. Amazing resource from a brilliant mind and brillaint teacher.

Deep Q-Learning with Keras and gym – great blog post that showcases code for a reinforcement learning agent to control a Open AI Gym environment. Useful both for the gym integration and using Keras to build a non-linear value function approximation. Many thanks to Keon Kim – check out his blog here.

Artificial Intelligence and the Future – Demis Hassabis is the co-founder and CEO of Google DeepMind.  In this talk he gives some great insight into the AlphaGo project.

Minh et. al (2013) Playing Atari with Deep Reinforcement Learning – to give you an idea of the importance of this paper – Google purchased DeepMind after this paper was published.  DeepMind was a company with no revenue, no customers and no product – valued by Google at $500M!  This is a landmark paper in reinforcement learning.

Minh et. al (2015) Human-level control through deep reinforcement learning – an update to the 2013 paper published in Nature.

I would also like to thank Data Science Retreat.  I’m just finishing up the three month immersive program – energy_py is my project for the course.  Data Science Retreat has been a fantastic experience and I would highly recommend it.  The course is a great way to invest in yourself, develop professionally and meet amazing people.

That’s it from me – thanks for reading!

Tuning Model Structure – Number of Layers & Number of Nodes

Imbalance Price Forecasting is a series applying machine learning to forecasting the UK Imbalance Price.  

Last post I introduced a new version of the neural network I am building.  This new version is a feedforward fully connected neural network written in Python built using Keras.

I’m now working on tuning model hyperparameters and structure. Previously I setup two experiments looking at:

  1. Activation functions
    • concluded rectified linear (relu) is superior to tanh, sigmoid & linear.
  2. Data sources
    • concluded more data the better.

In this post I detail two more experiments:

  1. Number of layers
  2. Number of nodes per layer

Python improvements

I’ve made two improvements to my implementation of Keras.  An updated script is available on my GitHub.

I often saw during training that the model trained on the last epoch was not necessarily the best model. I have made use of a ModelCheckpoint that saves the weights of the best model trained.

The second change I have made is to include dropout layers after the input layer and each hidden layer.  This is a better implementation of dropout!

Experiment one – number of layers

Model parameters were:

  • 15,000 epochs. Trained in three batches. 10 folds cross-validation.
  • 2016 Imbalance Price & Imbalance Volume data scraped from Elexon.
  • Feature matrix of lag 48 of Price & Volume & Sparse matrix of Settlement Period, Day of the week and Month.
  • Feed-forward neural network:
    • Input layer with 1000 nodes, fully connected.
    • 0-5 hidden layers with 1000 nodes, fully connected.
    • 1-6 dropout layers. One under input & each hidden layer.  30% dropout.
    • Output layer with 1000 nodes, single output node.
    • Loss function = mean squared error.
    • Optimizer = adam (default parameters).

Results of the experiments are shown below in Fig. 1 – 3.

Figure 1 – number of layers vs final training loss
Figure 2 – number of layers vs MASE

Figure 1 shows two layers with the smallest training loss.

Figure 2 shows that two layers also has the lowest CV MASE (although has a high training MASE).

Figure 3 – number of layers vs overfit. Absolute overfit = Test-Training. Relative = Absolute / Test.

In terms of overfitting two layers shows reasonable absolute & relative overfit.  The low relative overfit is due to a high training MASE (which minimizes the overfit for a constant CV MASE).

My conclusion from this set of experiments is to go forward with a model of two layers.  Increasing the number of layers beyond this doesn’t seem to improve performance.

It is possible that training for more epochs may improve the performance of the more complex networks which will be harder to train.  For the scope of this project I am happy to settle on two layers.

Experiment two – number of nodes

For the second set of experiments all model parameters were all as above except for:

  • 2 hidden layers with 50-1000 nodes.
  • 5 fold cross validation.
Figure 4 – number of layers vs final training loss
Figure 5 – number of layers vs MASE
Figure 6 – number of layers vs overfit.  Absolute overfit = Test-Training.  Relative = Absolute / Test

My conclusion from looking at the number of nodes is that 500 nodes per layer is the optimum result.

Conclusions

Both parameters can be further optimized using the same parametric optimization.  For the scope of this work I am happy to work with the results of these experiments.

I trained a final model using the optimal parameters.  A two layer & 500 node network achieved a test MASE of 0.455 (versus the previous best of 0.477).

Table 1 – results of the final model fitted (two layers, 500 nodes per layer)

The next post in this series will look at controlling overfitting via dropout.

Monte Carlo Q-Learning to Operate a Battery

I have a vision for using machine learning for optimal control of energy systems.  If a neural network can play a video game, hopefully it can understand how to operate a power plant.

In my previous role at ENGIE I built Mixed Integer Linear Programming models to optimize CHP plants.  Linear Programming is effective in optimizing CHP plants but it has limitations.

I’ll detail these limitations in future post – this post is about Reinforcement Learning (RL).  RL is a tool that can solve some of the limitations inherent in Linear Programming.

In this post I introduce the first stage of my own RL learning process. I’ve built a simple model to charge/discharge a battery using Monte Carlo Q-Learning. The script is available on GitHub.

I made use of two excellent blog posts to develop this.  Both of these posts give a good introduction to RL:

Features of the script
 

As I don’t have access to a battery system I’ve built a simple model within Python.  The battery model takes as inputs the state at time t, the action selected by the agent and returns a reward and the new state.  The reward is the cost/value of electricity charged/discharged.

def battery(state, action):  # the technical model
    # battery can choose to :
    #    discharge 10 MWh (action = 0)
    #    charge 10 MWh or (action = 1)
    #    do nothing (action = 2)

    charge = state[0]  # our charge level
    SP = state[1]  # the current settlement period
    action = action  # our selected action
    prices = getprices()
    price = prices[SP - 1]  # the price in this settlement period

    if action == 0:  # discharging
        new_charge = charge - 10
        new_charge = max(0, new_charge)  
        charge_delta = charge - new_charge
        reward = charge_delta * price
    if action == 1:  # charging
        new_charge = charge + 10
        new_charge = min(100, new_charge)
        charge_delta = charge - new_charge
        reward = charge_delta * price
    if action == 2:  # nothing
        charge_delta = 0
        reward = 0

    new_charge = charge - charge_delta
    new_SP = SP + 1
    state = (new_charge, new_SP)
    return state, reward, charge_delta

The price of electricity varies throughout the day.
The model is not fed this data explicitly – instead it learns through interaction with the environment.
 
One ‘episode’ is equal to one day (48 settlement periods).  The model runs through thousands of iterations of episodes and learns the value of taking a certain action in each state.  
 
Learning occurs by apportioning the reward for the entire episode to every state/action that occurred during that episode. While this method works, more advanced methods do this in better ways.
def updateQtable(av_table, av_count, returns):
    # updating our Q (aka action-value) table
    # ********
    for key in returns:
        av_table[key] = av_table[key] + (1 / av_count[key]) * (returns[key] - av_table[key])
    return av_table
The model uses an epsilon-greedy method for action selection.  Epsilon is decayed as the number of episodes increases.
Results
 
Figure 1 below shows the the optimal disptach for the battery model after training for 5,000 episodes.  
Figure 1 – Electricity prices [£/MWh] and the optimal battery dispatch profile [%]
I’m happy the model is learning well. Charging occurs during periods of low electricity prices. It is also fully draining the battery at the end of the day – which is logical behavior to maximise the reward per episode.  
 

Figure 2 below shows the learning progress of the model.

Figure 2 – Model learning progress
Next steps
 
Monte Carlo Q-learning is a good first start for RL. It’s helped me to start to understand some of the key concepts.
 
Next steps will be developing more advanced Q-learning methods using neural networks.

UK Imbalance Price Forecasting using Keras & TensorFlow

Previously I introduced a feedforward neural network model built using scikit-learn in Python. This post details an upgraded version which utilizes Keras, TensorFlow and Plotly.

Keras is a more flexible package for building neural networks. It has the ability to train models on GPU by using Google’s Tensorflow as the backend. I still use useful scikit-learn functions for splitting data into cross-validation and test/train sets.

As I’m becoming more comfortable with Python the quality of these scripts is improving. I’m using list comprehensions and lambda functions to make my code faster and cleaner. It’s a great feeling to rewrite parts of the script using new knowledge.

The full code is available on my GitHub repository for this project – I will also include a copy in this post.  The script philosophy is about running experiments on model structure and hyperparameters.

Features of this script

I needed to use a few workarounds to get TensorFlow to train models on the GPU.  First I had to install the tensorflow-gpu package (not tensorflow) through pip.  I then needed to access the TensorFlow session through the back end:

import keras.backend.tensorflow_backend as KTF

We also need to wrap our Keras model with a statement that sets the TensorFlow device as the GPU:

with KTF.tf.device('gpu:0'):

Finally I modify the Tensorflow session:

KTF.set_session(KTF.tf.Session(config=KTF.tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)))

These three things together got my Keras model training on my GPU.

Another workaround I had to include was a way to reset Keras models. I use K-fold cross validation to tune model hyperparameters. This leaves the test set free for use only in determining the generalization error of the model. It is bad machine learning practice to use the test set to tune hyperparameters.

K-fold cross-validation uses a loop to iterate through the training data K times. I found was the model weights were not being reset after each loop – even when I used a function to build & compile the model outside of the loop. The solution I used involves storing the initial weights then randomly permutating them each time we call the function.

The previous scikit-learn script used L2-regularization to control overfitting. In this new version I make use of a Keras dropout layer. Dropout randomly ignores selected neurons during training.

This means that their contribution to the activation of downstream neurons is removed on the forward pass. Any weight updates will not be applied to the neuron on the backwards pass. This should result in the network being less sensitive to the specific weights of neurons.

I’ve also made this script create new folders for each experiment and each individual model run. For each run this script saves the features and a plot of the forecast versus the actual value.

For the experiment it saves the results (MASE, MAE, training loss etc) for all runs and a plot of the training history for each run.

The last major step I have taken is starting to use Plotly. Plotly allows saving of interactive graph objects as HTML documents and uploading of graphs to their server. This has been a major step up from matplotlib as the graphs created are interactive.

Model Results

The next few posts in this series will be using this script to optimize the structure and hyperparameters.

The first experiment I ran was on activation functions.  The following parameters were used:

  • 2016 Imbalance Price & Imbalance Volume data scraped from Elexon
  • Feature matrix of lag 48 of Price & Volume & Sparse matrix of Settlement Period, Day of the week and Month
  • 8 layer neural network:
    • Input layer with 1000 nodes, fully connected
    • Dropout layer after input layer with 30 % dropout
    • Five hidden layers with 1000 nodes, fully connected
    • Output layer with 1000 nodes, single output node
    • Loss function = mean squared error
    • Optimizer = adam

I ran four experiments with different activation functions (the same function in each layer).

Table 1 – Experiment on activation functions (MASE = Mean Absolute Scaled Error)
Activation FunctionTraining MASETest MASETraining Loss
relu0.170.4786
tanh0.360.491,030
sigmoid0.710.722,183
linear0.600.622,269

Table 1 clearly shows that the rectifier (relu) activation function is superior for this task.

I ran a second experiment changing the data included in the feature matrix.  The sparse features are boolean variables for the Settlement Period, the day (Monday – Sunday) and the month.  These are included so that the model can capture trend and seasonality.

The models I ran in this experiment use the same parameters as Experiment One – except the data included in the feature matrix is changed:

  1. Lagged Imbalance Price only
  2. Lagged Imbalance Price & Imbalance Volume
  3. Lagged Imbalance Price & Sparse Features
  4. Lagged Imbalance Price, Imbalance Volume & Sparse features
Table 2 – Experiment on data sources (MASE = Mean Absolute Scaled Error)
Training MASECV MASETest MASEImproved CV MASE
Price only0.250.500.460.0%
Price & volume0.210.490.473.5%
Price & sparse0.170.500.490.7%
Price, volume & sparse0.150.470.456.1%

As expected increasing the amount of data for the model to learn from improves performance.  It’s very interesting how the improvement for Run 4 is more than the sum of the improvements for Runs 2 & 3!

It’s also a validation that the feature data has been processed correctly.  If the results got worse it’s because either a mistake was made during processing

Table 2 shows a large difference between the Training & Test MASE.  This indicates overfitting – something we will try to fix in future experiments by working with model structure and dropout.

So conclusions from our first two experiments – use the relu activation function and put in as many useful features as possible.

A quick look at the forecast for Run 4 of Experiment Two for the last few days in December 2016:

Figure 1 – The forecast versus actual Imbalance Price for the end of 2016

Click here to interact with the full 2016 forecast.

Next Steps

I’ll be doing more experiments on model structure and hyperparameters.  Experiments I have lined up include:

  • Model structure – number of layers & nodes per layer
  • Dropout fraction
  • Size of lagged data set

I’ll also keep working on improving the quality of code!

The Script

"""
ADG Efficiency
2017-01-31

This model runs experiments on model parameters on a neural network.
The neural network is built in Keras and uses TensorFlow as the backend.

To setup a model run you need to look at
- mdls_D
- mdls_L
- machine learning parameters
- pred_net_D

This script is setup for an experiment on the types
of data used as model features.

This model is setup to run an experiment on the sources of data.
"""

import os
import pandas as pd
import numpy as np
import sqlite3
import pickle
import datetime

# importing some useful scikit-learn stuff
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

# importing Keras & Tensorflow
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import adam, RMSprop, SGD
from keras.metrics import mean_absolute_error
import keras.backend.tensorflow_backend as KTF # https://groups.google.com/forum/#!topic/keras-users/MFUEY9P1sc8

import tensorflow as tf

# importing plotly
import plotly as py
import plotly.graph_objs as go 
import plotly.tools as tls 
tls.set_credentials_file(username="YOUR USERNAME", api_key="YOUR API KEY")

# imported from another script 
from metrics import MASE 

# workaround found on stack overflow
# http://stackoverflow.com/questions/40046619/keras-tensorflow-gives-the-error-no-attribute-control-flow-ops
tf.python.control_flow_ops = tf 

# grabbing the TF session from Keras backend
sess = KTF.get_session() 

# function that inputs the database path 
def get_data(db_path,table):
 conn = sqlite3.connect(db_path)
 data = pd.read_sql(sql='SELECT * from ' + str(table),con = conn)
 conn.close()
 print('got sql data')
 return data # returns a df with all table data

# setting base directory
base_path = os.path.dirname(os.path.abspath(__file__))

# location of SQL database 
sql_path = os.path.join(base_path,'data','ELEXON DATA.sqlite')
 
# saving run time for use in creating folder for this run of experiments
run_name = str(datetime.datetime.now())
run_name = run_name.replace(':','-')
run_path = str('Experiment Results//' + run_name)
results_csv_path = os.path.join(base_path, run_path, 'results.csv')
history_csv_path = os.path.join(base_path, run_path, 'histories.csv')

# dictionary containing all of the infomation for each report
data_price_D = {'Imbalance Price':{'Report name':'B1770','Data name':'imbalancePriceAmountGBP'}}

data_price_vol_D = {'Imbalance Price':{'Report name':'B1770','Data name':'imbalancePriceAmountGBP'},
 'Imbalance Volume':{'Report name':'B1780','Data name':'imbalanceQuantityMAW'}}

# list of model names - done manually to control order & which models run
mdls_L = ['price, vol & sparse'] # 'price only', 'price & volume', 'price & sparse', 

# dictionary of mdls with parameters
# FL = first lag. LL = last lag. SL = step size, 
# Sparse? = to include sparse data for trend/seasonality or not, 
# Data dict = which dictionary of data to use for the lagged time series
mdls_D = {'price only':{'date_start':'01/01/2016 00:00','date_end':'31/12/2016 00:00','FL':48,'LL':48*2,'SL':48,'Sparse?':False,'Data dict' : data_price_D},
'price & volume':{'date_start':'01/01/2016 00:00','date_end':'31/12/2016 00:00','FL':48,'LL':48*2,'SL':48,'Sparse?':False,'Data dict' : data_price_vol_D},
'price & sparse':{'date_start':'01/01/2016 00:00','date_end':'31/12/2016 00:00','FL':48,'LL':48*2,'SL':48,'Sparse?':True,'Data dict' : data_price_D},
'price, vol & sparse':{'date_start':'01/01/2016 00:00','date_end':'31/12/2016 00:00','FL':48,'LL':48*2,'SL':48,'Sparse?':True,'Data dict' : data_price_vol_D}}

# machine learning parameters
n_folds = 10 # number of folds used in K-fold cross validation
epochs = 200 # number of epochs used in training 
random_state_split = 5
random_state_CV = 3

# dataframes for use in tracking model results
results_cols = ['Model name','CV MASE','Training MASE','Test MASE','MASE',
 'Training MAE','Test MAE','MAE','CV Loss','Training Loss',
 'Training History','Error Time Series']
results_DF = pd.DataFrame(columns = results_cols)
results_all_DF = results_DF 
history_all_DF = pd.DataFrame()

# for loop to iterate through models
for mdl_index, mdl in enumerate(mdls_L):
 mdl_params = mdls_D[mdl]
 
 # resetting dataframes for storing results from this model run
 results_DF = pd.DataFrame(index = [mdl], columns = results_cols)
 history_DF = pd.DataFrame()

 # creating folder for this run
 exp_path = str(run_path + '/' + str(mdl))
 os.makedirs(exp_path)

 # setting model parameters
 date_start = mdl_params['date_start']
 date_start = mdl_params['date_end']
 first_lag = mdl_params['FL']
 last_lag = mdl_params['LL']
 step_lag = mdl_params['SL']
 include_sparse = mdl_params['Sparse?']
 data_D = mdl_params['Data dict']
 
 # unpacking dictionary of data to be used 
 data_sources = [key for key in data_D] # ie Imbalance Price, Imbalance Volume
 
 # getting the imbalance price first
 if data_sources[0] != 'Imbalance Price':
 price_loc = data_sources.index('Imbalance Price')
 first_source = data_sources[0]
 data_sources[price_loc] = first_source
 data_sources[0] = 'Imbalance Price'
 print(data_sources)

 table_names = [data_D[item]['Report name'] for item in data_sources] 
 data_col_names = [data_D[item]['Data name'] for item in data_sources]
 
 # getting data from SQL
 data_L = [get_data(db_path = sql_path, table = data_D[data_source]['Report name']) for data_source in data_sources]

 # list of the index objects
 indexes_L = [pd.to_datetime(raw_data['index']) for raw_data in data_L] 
 
 # list of the settlement periods
 SP_L = [raw_data['settlementPeriod'] for raw_data in data_L]

 # list of the actual data objects
 data_objs_L = [raw_data[data_col_names[index]].astype(float) for index, raw_data in enumerate(data_L)]
 
 # indexing these data objects
 for i, series in enumerate(data_objs_L):
 df = pd.DataFrame(data=series.values, index=indexes_L[i],columns=[series.name])
 data_objs_L[i] = df

 # creating feature dataframe - gets reset every model run
 data_DF = pd.DataFrame()
 for data_index, data_obj in enumerate(data_objs_L):
 # creating lagged data frame (make this a function)
 for i in range(first_lag,last_lag,step_lag):
 lag_start, lag_end, lag_step = 1, i, 1
 # creating the lagged dataframe
 data_name = data_obj.columns.values[0]
 lagged_DF = pd.DataFrame(data_obj)
 
 for lag in range(lag_start,lag_end+1,lag_step):
 lagged_DF[str(data_name) + ' lag_'+str(lag)] = lagged_DF[data_name].shift(lag) 
 print(lagged_DF.columns)
 lagged_DF = lagged_DF[lag_end:] # slicing off the dataframe
 index = lagged_DF.index # saving the index 
 
 data_DF = pd.concat([data_DF,lagged_DF],axis=1) # creating df with data 
 
 SP = SP_L[0] 
 SP = SP[(len(SP)-len(data_DF)):].astype(float) # slicing our settlement periods 
 
 # creating our sparse matricies for seasonality & trend
 date = index
 # creating label binarizer objects 
 encoder_SP = LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)
 encoder_days = LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)
 encoder_months = LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)
 # creating sparse settelment period feature object
 encoder_SP.fit(SP)
 encoded_SP = encoder_SP.transform(SP)
 SP_features = pd.DataFrame(encoded_SP, index, columns = list(range(1,51)))
 # creating sparse day of the week feature object
 days = list(map(lambda x: x.weekday(), date))
 encoder_days.fit(days)
 encoded_days = encoder_days.transform(days)
 days_features = pd.DataFrame(encoded_days, index = index, 
 columns = ['Mo','Tu','We','Th','Fr','Sa','Su'])
 # creating sparse month feature object
 months = list(map(lambda x: x.month, date))
 encoder_months.fit(months)
 encoded_months = encoder_months.transform(months)
 months_features = pd.DataFrame(encoded_months, index = index, 
 columns = ['Ja','Feb','Mar','Ap','Ma','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
 
 sparse_features = pd.concat([SP_features,days_features, months_features],axis=1) 
 
 print('incl sparse is ' + str(include_sparse))
 if include_sparse == True:
 print('including sparse')
 data_DF = pd.concat([data_DF,sparse_features],axis=1)
 
 # saving our feature matrix to a csv for checking
 features_path = os.path.join(base_path, exp_path, 'features.csv')
 data_DF.to_csv(features_path) 
 
 # creating our target matrix (the imbalance price)
 y = data_DF['imbalancePriceAmountGBP']
 y.reshape(1,-1)
 
 # dropping out the actual values from our data
 for data_col_name in data_col_names:
 data_DF = data_DF.drop(data_col_name,1)
 
 # setting our feature matrix 
 X = data_DF
 
 # splitting into test & train
 # keeping the split the same for different model runs
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = random_state_split)
 split_D = {'X_train' : len(X_train), 'X_test' : len(X_test)}
 
 # standardizing our data 
 X_scaler = StandardScaler()
 X_train = X_scaler.fit_transform(X_train)
 X_test = X_scaler.transform(X_test)
 X_all = X_scaler.transform(X)
 
 # reshaping
 X_train = np.asarray(X_train)
 y_train = np.asarray(y_train).flatten()
 
 X_test = np.asarray(X_test)
 y_test = np.asarray(y_test).flatten()
 
 X = np.asarray(X) # not sure if I use this anywhere - also this data is not standardized 
 y = np.asarray(y).flatten()
 
 # saving our scaler objects for use later
 pickle.dump(X_scaler, open(os.path.join(base_path,'pickle', 'X_scaler - ' + mdl + '.pkl'), 'wb'), protocol=4) 
 
 # length of the inputs
 training_samples = X_train.shape[0]
 input_length = X_train.shape[1] # should probably rename this to width....
 batch_size = int(training_samples/4)

 pred_net_D = {'price only':{'Layer 1' : Dense(1000, input_dim = input_length, activation = 'relu'), 'Layer 2' : Dense(1000, activation='relu'), 'Layer 3' : Dense(1000, activation='relu'), 'Layer 4' : Dense(output_dim = 1, activation='linear'),'Dropout fraction':0.3},
 'price & volume':{'Layer 1' : Dense(1000, input_dim = input_length, activation = 'relu'), 'Layer 2' : Dense(1000, activation='relu'), 'Layer 3' : Dense(1000, activation='relu'), 'Layer 4' : Dense(output_dim = 1, activation='linear'),'Dropout fraction':0.3},
 'price & sparse':{'Layer 1' : Dense(1000, input_dim = input_length, activation = 'relu'), 'Layer 2' : Dense(1000, activation='relu'), 'Layer 3' : Dense(1000, activation='relu'), 'Layer 4' : Dense(output_dim = 1, activation='linear'),'Dropout fraction':0.3}, 
 'price, vol & sparse':{'Layer 1' : Dense(1000, input_dim = input_length, activation = 'relu'), 'Layer 2' : Dense(1000, activation='relu'), 'Layer 3' : Dense(1000, activation='relu'), 'Layer 4' : Dense(output_dim = 1, activation='linear'),'Dropout fraction':0.3} 
 }
 
 # defining layers for prediction network
 pred_net_params = pred_net_D[mdl]
 layer1 = pred_net_params['Layer 1']
 layer2 = pred_net_params['Layer 2']
 layer3 = pred_net_params['Layer 2']
 layer4 = pred_net_params['Layer 2']
 layer5 = pred_net_params['Layer 2']
 layer6 = pred_net_params['Layer 2']
 layer7 = pred_net_params['Layer 2']
 layer8 = pred_net_params['Layer 2']
 layer9 = pred_net_params['Layer 2']
 layer10 = pred_net_params['Layer 4']
 dropout_fraction = pred_net_params['Dropout fraction'] 

 with KTF.tf.device('gpu:0'): # force tensorflow to train on GPU
 KTF.set_session(KTF.tf.Session(config=KTF.tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)))
 
 def get_model(): # use this function so I can recreate new network within CV loop
 network = Sequential() 
 network.add(layer1)
 network.add(Dropout(dropout_fraction))
 network.add(layer2)
 network.add(layer3)
 network.add(layer4)
 network.add(layer5)
 network.add(layer6)
 network.add(layer10)
 return network
 
 # https://gist.github.com/jkleint/eb6dc49c861a1c21b612b568dd188668 
 def shuffle_weights(model, weights=None):
 """Randomly permute the weights in `model`, or the given `weights`.
 This is a fast approximation of re-initializing the weights of a model.
 Assumes weights are distributed independently of the dimensions of the weight tensors
 (i.e., the weights have the same distribution along each dimension).
 :param Model model: Modify the weights of the given model.
 :param list(ndarray) weights: The model's weights will be replaced by a random permutation of these weights.
 If `None`, permute the model's current weights.
 """
 if weights is None:
 weights = model.get_weights()
 weights = [np.random.permutation(w.flat).reshape(w.shape) for w in weights]
 # Faster, but less random: only permutes along the first dimension
 # weights = [np.random.permutation(w) for w in weights]
 model.set_weights(weights)
 
 optimizer = adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)

 # cross validation
 print('Starting Cross Validation')
 CV_network = get_model()
 CV_network.compile(loss='mean_squared_error', optimizer=optimizer)
 initial_weights_CV = CV_network.get_weights()
 CV = KFold(n_splits=n_folds, random_state = random_state_CV)
 MASE_CV_L = []
 loss_CV_L = []

 for train, test in CV.split(X_train, y_train):
 shuffle_weights(CV_network, initial_weights_CV)
 mdl_CV = CV_network.fit(X_train[train], y_train[train], nb_epoch = epochs, batch_size = batch_size) 

 y_CV_pred = CV_network.predict(X_train[test],batch_size = batch_size, verbose=0).flatten()
 MASE_CV = MASE(y_train[test],y_CV_pred, 48)
 MASE_CV_L.append(MASE_CV)
 loss_CV = mdl_CV.history['loss'][-1]
 loss_CV_L.append(loss_CV)
 
 MASE_CV = np.average(MASE_CV_L)
 loss_CV = np.average(loss_CV_L)
 
 # training network on all training data 
 print('Training prediction network')
 network = get_model()
 network.compile(loss='mean_squared_error', optimizer=optimizer) 
 initial_weights_net = network.get_weights()
 
 shuffle_weights(network, initial_weights_net)
 mdl_net = network.fit(X_train, y_train, nb_epoch = epochs, batch_size = batch_size)
 
 y_pred_train = network.predict(X_train,batch_size = batch_size, verbose=0).flatten()
 y_pred_test = network.predict(X_test,batch_size = batch_size, verbose=0).flatten()
 y_pred = network.predict(X_all,batch_size = batch_size, verbose=0).flatten()
 
 error_train = y_pred_train - y_train
 error_test = y_pred_test - y_test 
 error = y - y_pred
 abs_error = abs(error) 
 
 MASE_train = MASE(y_train, y_pred_train, 48)
 MASE_test = MASE(y_test, y_pred_test, 48)
 MASE_all = MASE(y, y_pred, 48)
 
 MAE_train = mean_absolute_error(y_train, y_pred_train).eval(session=sess)
 MAE_test = mean_absolute_error(y_test, y_pred_test).eval(session=sess)
 MAE_all = mean_absolute_error(y, y_pred).eval(session=sess)
 
 results_DF.loc[mdl,'Model name'] = mdl
 results_DF.loc[mdl,'CV MASE'] = MASE_CV
 results_DF.loc[mdl,'Training MASE'] = MASE_train
 results_DF.loc[mdl,'Test MASE'] = MASE_test
 results_DF.loc[mdl,'MASE'] = MASE_all
 
 results_DF.loc[mdl,'Training MAE'] = MAE_train
 results_DF.loc[mdl,'Test MAE'] = MAE_test
 results_DF.loc[mdl,'MAE'] = MAE_all
 
 results_DF.loc[mdl,'CV Loss'] = loss_CV
 results_DF.loc[mdl,'Training Loss'] = mdl_net.history['loss'][-1]

 results_DF.loc[mdl,'Error Time Series'] = error
 results_DF.loc[mdl,'Training History'] = mdl_net.history['loss']

 history_DF = pd.DataFrame(data=list(mdl_net.history.values())[0], index = list(range(1,epochs+1)), columns=[mdl])
 
 # figure 1 - plotting the actual versus prediction 
 actual_imba_price_G = go.Scatter(x=index,y=y,name='Actual',line=dict(width=2)) 
 predicted_imba_price_G = go.Scatter(x=index,y=y_pred,name='Predicted',line=dict(width=2, dash = 'dash'))
 fig1_data = [actual_imba_price_G, predicted_imba_price_G]
 fig1_layout = go.Layout(title='Forecast',yaxis=dict(title='Imbalance Price [£/MWh]'))
 fig1 = go.Figure(data=fig1_data,layout=fig1_layout) 
 fig1_name = os.path.join(exp_path,'Figure 1.html')
 py.offline.plot(fig1,filename = fig1_name, auto_open = False) # creating offline graph
 # py.plotly.plot(fig1, filename='Forecast', sharing='public') # creating online graph
 
 # saving results
 network_params_DF = pd.DataFrame(pred_net_params, index=[mdl])
 mdl_params_DF = pd.DataFrame(mdl_params, index=[mdl])
 results_DF = pd.concat([results_DF, network_params_DF, mdl_params_DF], axis = 1, join = 'inner')
 results_all_DF = pd.concat([results_all_DF, results_DF], axis = 0)
 history_all_DF = history_all_DF.join(history_DF, how='outer')

print(results_all_DF) 

results_all_DF.to_csv(results_csv_path) 
history_all_DF.to_csv(history_csv_path) 

# figure 2 - comparing training history of models
fig2_histories = [history_all_DF[col] for col in history_all_DF]
fig2_data = [go.Scatter(x = data.index, y = data, name=data.name) for data in fig2_histories]
fig2_layout = go.Layout(title='Training History',yaxis=dict(title='Loss'),xaxis=dict(title='Epochs'))
fig2 = go.Figure(data=fig2_data,layout=fig2_layout)
fig2_name = os.path.join(run_path,'Figure 2.html')
py.offline.plot(fig2,filename = fig2_name, auto_open = False) # creating offline graph

 

 

Elexon API Web Scraping using Python – Updated

This post is part of a series applying machine learning techniques to an energy problem.  The goal of this series is to develop models to forecast the UK Imbalance Price. 

This post is an update to an earlier post in this series showing how to use Python to access data from Elexon using their API.  As with the previous script you can grab data for any reports Elexon offer through their API by iterating through a dictionary of reports with their keyword arguments.
This script solves two problems that occur with the Elexon data – duplicate and missing Settlement Periods.  You can view the script on my GitHub here.

Improvements

Duplicate Settlement Periods are removed using the drop_duplicates Data Frame method in pandas.

Missing Settlement Periods are dealt with by first creating an index object of the correct length (SP_DF).  Note that this takes into account daylight savings days (where the correct length of the index is either 46 or 50) using the transition times feature of the pytz module.  Transition times allows identification of which date daylight savings time changes occur – very helpful!

The SP_DF is then joined (using an ‘outer join’) with the data returned from Elexon – meaning that any missing Settlement Periods are identified.  Any missing values are filled in with the average value for that day.

GitHub

I’m going to start using GitHub as a way to manage this project – you can find the script to scrape Elexon data as well as a SQL database with 2016 data for the Imbalance Price and Imbalance Volume on my UK Imbalance Price Forecasting repository.

I’ve also put another small script I use to check the quality of the SQL database called database checking.py.  This script should probably be built into the scraping script!  However I’ve decided to spend more time analyzing and building models 🙂

Next posts in this series will detail some dramatically improved models for predicting the Imbalance Price – making use of Keras, Tensorflow and Plotly.

Tuning regularization strength

This post is the fifth in a series applying machine learning techniques to an energy problem. The goal of this series is for me to teach myself machine learning techniques by developing models to forecast the UK Imbalance Price

I see huge promise in what machine learning can do in the energy industry.  This series details my initial efforts in gaining an understanding of machine learning.  

Part One – What is the UK Imbalance Price?
Part Two – Elexon API Web Scraping using Python
Part Three – Imbalance Price Visualization
Part Four – Multilayer Perceptron Neural Network

In the previous post in this series we introduced a Multi Layer Perceptron neural network to predict the UK Imbalance Price.   This post will dig a bit deeper into optimizing the degree of overfitting of our model.  We do this through tuning the strength of regularization.

What is regularization

Regularization is a tool used to combat the problem of overfitting a model.  Overfitting occurs when a model starts to fit the training data too well – meaning that performance on unseen data is poor.

To prevent overfitting to the training data we can try to keep the model parameters small using regularization.  If we include a regularization term in the cost function the model minimizes we can encourage the model to use smaller parameters.

The equation below shows the loss function minimized during model training.  The first term is the square of the error.  The second term is the regularization term – with lambda shown as the parameter to control regularization.  In order to be consistent with scikit-learn, we will refer to this parameter as alpha.

Regularization penalizes large values of the model parameters (theta) based on the size of the regularization parameter.  Regularization comes in two flavours – L1 and L2.  The MLP Regressor model in scikit-learn uses L2 regularization.

Setting alpha too large will result in underfitting (also known as a high bias problem).  Setting alpha too small may lead to overfitting (a high variance problem).

Setting alpha in the UK Imbalance Price model

Here we will optimize alpha by iterating through a number of different values.

We can then evaluate the degree of overfitting by looking at how alpha affects the loss function and the Mean Absolute Scaled Error (MASE).  The loss function is the cost function the model minimizes during training.  The MASE is the metric we used to judge model performance.

We use K-fold cross validation to get a sense of the degree of overfitting.  Comparing the cross validation to training performance gives us an idea of how much our model is overfitting.  Using K-fold cross validation allows us to leave the test data free for evaluating model performance only.

Figure 1 & 2 show the results of optimizing alpha for a MLP Regressor with five hidden layers of 1344 nodes each.  The input feature set is the previous one week of Imbalance Price data.

Figure 1 shows the effect of alpha on the loss function for the training and cross validation sets.  We would expect to see the training loss increase as alpha increases.  Small values of alpha should allow the model to overfit.  We would also expect to see the loss for the training and CV sets to coverge as alpha gets large.

Figure 1 – The effect of alpha on the loss function

Figure 1 shows the expected trend with the training loss increasing as alpha increases – except for alpha = 0.0001 which shows a high training loss.  This I don’t understand!  I was expecting that training loss would decrease with decreasing alpha.

Figure 1 shows the effect of alpha on the Mean Absolute Squared Error for the training, cross validation and test sets.

Figure 2 – The effect of alpha on the MASE for the training, cross validation and test data

Figure 2 also shows a confusing result.  I was expecting to see the MASE be a minimum at the smallest alpha and increase as alpha increased.  This is because small values of alpha should allow the model to overfit (and improve performance). Instead we see that the best training MASE is at alpha = 0.01.

Figure 2 shows a minimum for the test MASE at alpha = 0.01 – this is also the minimum for the training data.

Going forward I will be using a value of 0.01 for alpha as this shows a good balance between minimizing the loss for the training and cross validation sets.

Table 1 shows the results for the model as it currently stands.

Table 1 – Model performance with alpha = 0.01

Training MASE0.3345
Cross validation MASE0.5890
Test MASE0.5212

Next step in this project is looking at previously unseen data for December 2016 – stay tuned.