Category Archives: Reinforcement Learning

Using reinforcement learning to control battery storage

If you just want to skip to the code
the example in this post was generated from this notebook
the DQN agent
the parameterization of Q(s,a) using Tensorflow
the battery environment

Using reinforcement learning to control energy systems is the application of machine learning I’m most excited about.

In this post I’ll show how an agent using the DQN algorithm can learn to control electric battery storage.

reinforcement learning in energy

importance of battery storage

One of the key issues with wind and solar is intermittency. Solar only generates when it’s sunny, wind only generates when it’s windy, but we demand access to electricity 24/7.

One way to manage this intermittency is with flexible demand. This is what we are doing at Tempus Energy, where we are using machine learning to unlock the value of flexible electricity assets.

Another way is with energy storage. The ability to store excess for when the grid needs it allows intermittent generation to be matched against demand.

But energy storage represents a double edged sword – it can actually end up doing more harm that good. The 2016 Southern California Gas company Advanced Energy Storage Impact report shows that commercial storage projects actually increased 2016 carbon emissions by 726 tonC.

Part of this is driven by misaligned incentives. Even with aligned incentives dispatching a battery is still challenging. Getting batteries supporting the grid requires progress in multiple areas:
– decreasing total system installation costs
– aligning incentives with prices that reflect the value of the battery to the grid
– intelligent operation

This work supports operating battery storage with intelligence.

reinforcement learning in energy

Reinforcement learning can be used to intelligently operate energy systems. In this post I show how a reinforcement learning agent based on the Deep Q-Learning (DQN) algorithm can learn to control a battery.

It’s a simplifed problem. The agent is given a perfect forecast of the electricity price, and the electricity price itself is a repetitive profile. It’s still very exciting to see the agent learn!

In reality an agent is unlikely to receive a perfect forecast. I expect learning a more realistic and complex problem would require a combination of:
– tuning hyper parameters
– a higher capacity or different structure neural network to approximating value functions and policies
– more steps of experience
– a different algorithm (AC3, PPO, TRPO, C51 etc.)
– learning an environment model

The agent and environment I used to generate these results are part of an open source Python library. energy_py is a collection of reinforcement learning agents, energy environments and tools to run experiments.

I’ll go into a bit of detail about the agent and environment below. The notebook used to generate the results is here.

the agent – DQN

DeepMind’s early work with Q-Learning and Atari games is foundational in modern reinforcement learning.  The use of a deep convolution neural network allowed the agent to learn from raw pixels (known as end to end deep learning).  The use of experience replay and target networks improved learning stability, and produced agents that could generalize across a variety of different games.

The initial 2013 paper (Mnih et. al 2103) was so significant that in 2014 DeepMind were purchased by Google for around £400M.  This is for a company with no product, no revenue, no customers and a few employees.

The DQN algorithm used in the second DeepMind Atari paper (Mnih et. al 2015) is shown below.

Figure 1 – DQN algorithm as given by Mnih et. al (2015) – I’ve added annotation in green.

In Q-Learning the agent learns to approximate the expected discounted return for each action. The optimal action is then selected by argmaxing over Q(s,a) for each possible action. This argmax operation allows Q-Learning to learn off-policy – to learn from experience generated by other policies.

Experience replay makes learning more independent and identically distributed by sampling randomly from the experience of previous policies. It is also possible to use human generated experience with experience replay. Experience replay can be used because Q-Learning is an off-policy learning algorithm.

A target network is used to improve learning stability by creating training Bellman targets from an older copy of the online Q(s,a) network. You can either copy the weights over every n steps or use a weighted average of previous parameters.

One of the issues with Q-Learning is the requirement of a discrete action space. In this example I discretize the action space into 100 actions. The balance with discretization is:
– too low = control is coarse
– too high = computational expense

I use a neural network to approximate Q(s,a). I’m using TensorFlow as the library to provide the machinery for using and improving this simple two layer neural network. Even though I’m using the DQN algorithm I’m not using a particularly deep neural network.

I make use of relu’s between the layers and no batch normalization. I preprocess the inputs (removing mean and scaling by standard deviation) and targets (min-max normalization) used with the neural network using energy_py Processor objects. I use the Adam optimizer with a learning rate of 0.0025.

The network has one output node per action – since I choose to discretize the action space with 5 discrete actions for each action, there are 10 total discrete actions and 10 output nodes in the neural network.

There are a number of other hyperparameters to tune such as the rate of decay of epsilon for exploration and how frequently to update the target network to keep learning stable. I set these using similar ratios to the 2015 DeepMind Atari paper (adjusting the ratios for the total number of steps I train for each experiment).

Figure 2 – DQN hyperparameters Mnih et. al (2015).

the environment – battery storage

The battery storage environment I’ve built is the application of storing cheap electricity and discharging when it’s expensive (price arbitrage.) This isn’t the only application of battery storage – Tesla’s 100 MW, 129 MWh battery in South Australia is being used for fast frequency response with impressive results.

I’ve tried to make the environment as Markov as possible – given a perfect forecast enough steps ahead I think battery storage problem is pretty Markov. The challenge using this in practice comes from having to use imperfect price forecasts.

The state space for the environment is the true price of electricity and the charge of the battery at the start of the step. The electricity price follows a fixed profile defined in state.csv.

The observation space is a perfect forecast of the electricity price five steps ahead. The number of steps ahead required for the Markov property will depend on the profile and the discount rate.

The action space is a one dimensional array – the first element being the charge and the second the discharge. The net effect of the action on the battery is the difference between the two.

The reward is the net rate of charge or discharge multiplied by the current price of electricity. The rate is net of an efficiency penalty applied to charging electricity. At a 90% efficiency a charge rate of 1 MW for one hour would result in only 0.9 MWh of electricity stored in the battery.


The optimal operating strategy for energy storage is very application dependent. Given the large number of potential applications of storage this means a large number of optimal operating patterns are likely to exist.

The great thing about using reinforcement learning to learn these patterns is that we can use the same algorithm to learn any pattern. Building virtual environments for all these different applications is the first step in proving this.

Below is a HTML copy of the Jupyter Notebook used to run the experiment. You can see the same notebook on GitHub here.

further work

Building the energy_py library is the most rewarding project in my career so far. I’ve been working on it for around one year, taking inspiration from other open source reinforcement learning libraries and improving my Python & reinforcement learning understanding along the way. My TODO list for energy_py is massive!

Lots of work to do to make the DQN code run faster. No doubt I’m making silly mistakes! I’m using two separate Python classes for the online and target network – it might be more efficient to have both networks be part of the same object. I also need to think about combining graph operations to reduce the number of Prioritized experience replay is another option to improve sample efficiency.

Less Markov & more realistic state and observation spaces – giving the agent imperfect forecasts. Multiple experiments across different random seeds.

Test ability to generalize to unseen profiles. This is the most important one. The current agent has the ability to memorize what to do (rather than understand the dynamics of the MDP).

I’ve just finished building a Deterministic Policy Gradient which I’m looking forward to playing around with.

Thanks for reading!

A Glance at Reinforcement Learning

One of my professional highlights of 2017 has been teaching an introductory reinforcement learning course – A Glance at Reinforcement LearningYou can find the course materials on GitHub.

This one day course is aimed at data scientists with a grasp of supervised machine learning but no prior understanding of reinforcement learning.

Course scope
– introduction to the fundamental concepts of reinforcement learning
– value function methods
dynamic programming, Monte Carlo, temporal difference, Q-Learning, DQN
– policy gradient methods
score function, REINFORCE, advantage actor-critic, AC3
– AlphaGo
– practical concerns
reward scaling, mistakes I’ve made, advice from Vlad Mnih & John Schulman
– literature highlights
distributional perspective, auxiliary loss functions, inverse RL

I’ve given this course to three batches at Data Science Retreat in Berlin and once to a group of startups from Entrepreneur First in London.  Each time I’ve had great questions, kind feedback and improved my own understanding.

I also meet great people – it’s the kind of high-quality networking that is making a difference in my career. I struggle with ‘cold networking’ (i.e. drinks after a Meetup). Teaching and blogging are much better at creating meaningful professional connections.

I’m not an expert in reinforcement learning – I’ve only been studying the topic for a year. I try to use this to my advantage – I can remember what I struggled to understand, which helps design the course to get others up to speed quicker.

If you are looking to develop your understanding of reinforcement learning, the two best places to start are Reinforcement Learning: An Introduction (Sutton & Barto) and David Silver’s lecture series on YouTube.

The course compliments the development of energy_py – an energy-focused reinforcement learning library.

I’d like to thank Jose Quesada and Chris Armbruster for the opportunity to teach at Data Science Retreat.  I’d also like to thank Alex Appelbe and Bashir Beikzadeh of Metis Labs for the opportunity to teach at Entrepreneur First.

energy_py update – July 2017

energy_py is a collection of reinforcement learning agents and environments for energy systems. You can read the introductory blog post for the project and check out the repo on GitHub.

Saving of memory and value function after each episode

This quality of life improvement has a major impact on the effectiveness of training agents using energy_py. It means an agent can keep learning from experience that occurred during a different training session.

As I train models on my local machine I often can only dedicate enough time for 10 episodes of training. Saving the memory & value functions an agent to learn from hundreds of episodes without training every episode in one run.

Running each episode on a different time series

Training agents with randomly selected weeks in the year. It’s much more useful for an agent to experience two different weeks of CHP operation than to experience the same week over and over again. It also should help the agent to generalize to operating data sets it hasn’t seen before.

Double Q-Learning

Building another agent has been a todo for energy_py for a long time. I’ve built a Double Q-Leaner – based on the algorithm given in Sutton & Barto. The key extension in Double Q-Learning is to maintain two value functions.

The policy is generated using the average of the estimate for both Q networks. One network is then randomly selected for training using a target created by the other network.

The thinking behind Double Q-Learning is that we can avoid the maximization bias of Q-Learning. A positive bias is caused by the use of maximization operations for estimating the value of states. The maximization functions lead to overoptimistic estimates of the value of state actions.

Next major tasks are:
1 – build a policy gradient method – most likely a Monte Carlo policy gradient algorithm,
2 – build a demand side response environment.

Thanks for reading!

A Glance at Q-Learning

‘A Glance at Q-Learning’ is a talk I recently gave at the Data Science Festival in London. The talk was one I also gave in Berlin at the Berlin Machine Learning group.

Q-Learning is a reinforcement learning algorithm that DeepMind used to play Atari games – work which some call the first step towards a general artificial intelligence. The original 2013 paper is available here (I cover this paper in the talk).

It was a wonderful experience being able to present – I recommend checking out more of the talks on the Data Science Festival YouTube – all of which are higher quality, more interesting and better presented than mine!

You can download a copy of my slides here – A Glance at Q-Learning slides.

Thanks for reading!

energy_py – reinforcement learning for energy systems

If you just want to skip to the code, the energy_py library is here.

energy_py is reinforcement learning for energy systems.  

Using reinforcement learning agents to control virtual energy environments is the first step towards using reinforcement learning to optimize real-world energy systems. This is a professional mission of mine – to use reinforcement learning to control real world energy systems.

energy_py supports this goal by providing a collection of reinforcement learning agents, energy environments and tools to run experiments.

What is reinforcement learning

supervised vs unsupervised vs reinforcement

Reinforcement learning is the branch of machine learning where an agent learns to interact with an environment.  Reinforcement learning can give us generalizable tools to operate our energy systems at superhuman levels of performance.

It’s quite different from supervised learning. In supervised learning we start out with a big data set of features and our target. We train a model to replicate this target from patterns in the data.

In reinforcement learning we start out with no data. The agent generates data (sequences of experience) by interacting with the environment. The agent uses it’s experience to learn how to interact with the environment. In reinforcement learning we not only learn patterns from data, we also generate our own data.

This makes reinforcement learning more democratic than supervised learning. The reliance on massive amounts of labelled training data gives companies with unique datasets an advantage. In reinforcement learning all that is needed is an environment (real or virtual) and an agent.

If you are interested in reading more about reinforcement learning, the course notes from a one-day introductory course I teach are hosted here.

Why do we need reinforcement learning in energy systems

Optimal operation of energy assets is already very challenging. Our current energy transition makes this difficult problem even harder.

The rise of intermittent generation is introducing uncertainty on the generation and demand side. The rise of distributed generators and increasing the number of actions available to operators.

For a wide range of problems machine learning results are both state of the art and better than human experts. We can get this level of performance using reinforcement learning in our energy systems.

Today many operators use rules or abstract models to dispatch assets. A set of rules is not able to guarantee optimal operation in many energy systems.

Optimal operating strategies can be developed from abstract models. Yet abstract models (such as linear programming) are often constrained. These models are limited to approximations of the actual plant.  Reinforcement learners are able to learn directly from their experience of the actual plant. These abstract models also require significant amount of bespoke effort by an engineer to setup and validate.

With reinforcement learning we can use the ability of the same agent to generalize to a number of different environments. This means we can use a single agent to both learn how to control a battery and to dispatch flexible demand. It’s a much more scalable solution than developing site by site heuristics or building an abtract model for each site.

beautiful wind turbines

There are challenges to be overcome. The first and most important is safety. Safety is the number one concern in any engineering discipline.

I believe that by reinforcement learning should be first applied on as high a level of the control system as possible. This allows the number of actions to be limited and existing lower level safety & control systems can remain in place. The agent is limited to only making the high level decisions operators make today.

There is also the possibility to design the reward function to incentivize safety. A well-designed reinforcement learner could actually reduce hazards to operators. Operators also benefit from freeing up more time for maintenance.

A final challenge worth addressing is the impact such a learner could have on employment. Machine learning is not a replacement for human operators. A reinforcement learner would not need a reduction in employees to be a good investment.

The value of using a reinforcement learner is to let operations teams do their jobs better.
It will allow them to spend more time and improve performance for their remaining responsibilities such as maintaining the plant.  The value created here is a better-maintained plant and a happier workforce – in a plant that is operating with superhuman levels of economic and environmental performance.

Any machine requires downtime – a reinforcement learner is no different. There will still be time periods where the plant will operate in manual or semi-automatic modes with human guidance.

energy_py is one step on a long journey of getting reinforcement learners helping us in the energy industry. The fight against climate change is the greatest that humanity faces. Reinforcement learning will be a key ally in fighting it. You can checkout the repository on GitHub here.


The best place to take a look at the library is the example of using Q-Learning to control a battery. The example is well documented in this Jupyter Notebook and this blog post.

My reinforcement learning journey

I’m a chemical engineer by training (B.Eng, MSc) and an energy engineer by profession. I’m really excited about the potential of machine learning in the energy industry – in fact that’s what this blog is about!

My understanding of reinforcement learning has come from a variety of resources. I’d like to give credit to all of the wonderful resources I’ve used to understand reinforcement learning.

Sutton & Barto – Reinforcement Learning: An Introduction – the bible of reinforcement learning and a classic machine learning text.

Playing Blackjack with Monte Carlo Methods – I built my first reinforcement learning model to operate a battery using this post as a guide. This post is part two of an excellent three part series. Many thanks to Brandon of Δ ℚuantitative √ourney.

RL Course by David Silver – over 15 hours of lectures from Google DeepMind’s lead programmer – David Silver. Amazing resource from a brilliant mind and brillaint teacher.

Deep Q-Learning with Keras and gym – great blog post that showcases code for a reinforcement learning agent to control a Open AI Gym environment. Useful both for the gym integration and using Keras to build a non-linear value function approximation. Many thanks to Keon Kim – check out his blog here.

Artificial Intelligence and the Future – Demis Hassabis is the co-founder and CEO of Google DeepMind.  In this talk he gives some great insight into the AlphaGo project.

Minh et. al (2013) Playing Atari with Deep Reinforcement Learning – to give you an idea of the importance of this paper – Google purchased DeepMind after this paper was published.  DeepMind was a company with no revenue, no customers and no product – valued by Google at $500M!  This is a landmark paper in reinforcement learning.

Minh et. al (2015) Human-level control through deep reinforcement learning – an update to the 2013 paper published in Nature.

I would also like to thank Data Science Retreat.  I’m just finishing up the three month immersive program – energy_py is my project for the course.  Data Science Retreat has been a fantastic experience and I would highly recommend it.  The course is a great way to invest in yourself, develop professionally and meet amazing people.

Monte Carlo Q-Learning to Operate a Battery

I have a vision for using machine learning for optimal control of energy systems.  If a neural network can play a video game, hopefully it can understand how to operate a power plant.

In my previous role at ENGIE I built Mixed Integer Linear Programming models to optimize CHP plants.  Linear Programming is effective in optimizing CHP plants but it has limitations.

I’ll detail these limitations in future post – this post is about Reinforcement Learning (RL).  RL is a tool that can solve some of the limitations inherent in Linear Programming.

In this post I introduce the first stage of my own RL learning process. I’ve built a simple model to charge/discharge a battery using Monte Carlo Q-Learning. The script is available on GitHub.

I made use of two excellent blog posts to develop this.  Both of these posts give a good introduction to RL:

Features of the script

As I don’t have access to a battery system I’ve built a simple model within Python.  The battery model takes as inputs the state at time t, the action selected by the agent and returns a reward and the new state.  The reward is the cost/value of electricity charged/discharged.

def battery(state, action):  # the technical model
    # battery can choose to :
    #    discharge 10 MWh (action = 0)
    #    charge 10 MWh or (action = 1)
    #    do nothing (action = 2)

    charge = state[0]  # our charge level
    SP = state[1]  # the current settlement period
    action = action  # our selected action
    prices = getprices()
    price = prices[SP - 1]  # the price in this settlement period

    if action == 0:  # discharging
        new_charge = charge - 10
        new_charge = max(0, new_charge)  
        charge_delta = charge - new_charge
        reward = charge_delta * price
    if action == 1:  # charging
        new_charge = charge + 10
        new_charge = min(100, new_charge)
        charge_delta = charge - new_charge
        reward = charge_delta * price
    if action == 2:  # nothing
        charge_delta = 0
        reward = 0

    new_charge = charge - charge_delta
    new_SP = SP + 1
    state = (new_charge, new_SP)
    return state, reward, charge_delta

The price of electricity varies throughout the day.
The model is not fed this data explicitly – instead it learns through interaction with the environment.
One ‘episode’ is equal to one day (48 settlement periods).  The model runs through thousands of iterations of episodes and learns the value of taking a certain action in each state.  
Learning occurs by apportioning the reward for the entire episode to every state/action that occurred during that episode. While this method works, more advanced methods do this in better ways.
def updateQtable(av_table, av_count, returns):
    # updating our Q (aka action-value) table
    # ********
    for key in returns:
        av_table[key] = av_table[key] + (1 / av_count[key]) * (returns[key] - av_table[key])
    return av_table
The model uses an epsilon-greedy method for action selection.  Epsilon is decayed as the number of episodes increases.
Figure 1 below shows the the optimal disptach for the battery model after training for 5,000 episodes.  
Figure 1 – Electricity prices [£/MWh] and the optimal battery dispatch profile [%]
I’m happy the model is learning well. Charging occurs during periods of low electricity prices. It is also fully draining the battery at the end of the day – which is logical behavior to maximise the reward per episode.  

Figure 2 below shows the learning progress of the model.

Figure 2 – Model learning progress
Next steps
Monte Carlo Q-learning is a good first start for RL. It’s helped me to start to understand some of the key concepts.
Next steps will be developing more advanced Q-learning methods using neural networks.