*If you just want to skip to the code*

– the example in this post was generated from this notebook

– the DQN agent

– the parameterization of *Q(s,a)* using Tensorflow

– the battery environment

Using **reinforcement learning** to **control energy systems** is the application of machine learning I’m most excited about.

In this post I’ll show how an agent using the DQN algorithm can learn to control electric battery storage.

### importance of battery storage

One of the key issues with wind and solar is intermittency. Solar only generates when it’s sunny, wind only generates when it’s windy, but we demand access to electricity 24/7.

**One way to manage this intermittency is with flexible demand**. This is what we are doing at Tempus Energy, where we are using machine learning to unlock the value of flexible electricity assets.

**Another way is with energy storage**. The ability to store excess for when the grid needs it allows intermittent generation to be matched against demand.

But energy storage represents a double edged sword – it can actually end up doing more harm that good. The 2016 Southern California Gas company Advanced Energy Storage Impact report shows that commercial storage projects actually increased 2016 carbon emissions by 726 tonC.

Part of this is driven by misaligned incentives. Even with aligned incentives dispatching a battery is still challenging. Getting batteries supporting the grid requires progress in multiple areas:

– decreasing total system installation costs

– aligning incentives with prices that reflect the value of the battery to the grid

– intelligent operation

This work supports operating battery storage with intelligence.

### reinforcement learning in energy

Reinforcement learning can be used to intelligently operate energy systems. In this post I show how a reinforcement learning agent based on the Deep Q-Learning (DQN) algorithm can learn to control a battery.

It’s a simplifed problem. The agent is given a perfect forecast of the electricity price, and the electricity price itself is a repetitive profile. It’s still very exciting to see the agent learn!

In reality an agent is unlikely to receive a perfect forecast. I expect learning a more realistic and complex problem would require a combination of:

– tuning hyper parameters

– a higher capacity or different structure neural network to approximating value functions and policies

– more steps of experience

– a different algorithm (AC3, PPO, TRPO, C51 etc.)

– learning an environment model

The agent and environment I used to generate these results are part of an open source Python library. **energy_py is a collection of reinforcement learning agents, energy environments and tools to run experiments.**

I’ll go into a bit of detail about the agent and environment below. The notebook used to generate the results is here.

### the agent – DQN

DeepMind’s early work with Q-Learning and Atari games is foundational in modern reinforcement learning. The use of a deep convolution neural network allowed the agent to learn from raw pixels (known as end to end deep learning). The use of experience replay and target networks improved learning stability, and produced agents that could generalize across a variety of different games.

The initial 2013 paper (Mnih et. al 2103) was so significant that in 2014 DeepMind were purchased by Google for around £400M. This is for a company with no product, no revenue, no customers and a few employees.

The DQN algorithm used in the second DeepMind Atari paper (Mnih et. al 2015) is shown below.

**Figure 1 – DQN algorithm as given by Mnih et. al (2015) – I’ve added annotation in green. **

In Q-Learning the agent learns to approximate the expected discounted return for each action. The optimal action is then selected by argmaxing over *Q(s,a)* for each possible action. This argmax operation allows Q-Learning to learn off-policy – to learn from experience generated by other policies.

Experience replay makes learning more independent and identically distributed by sampling randomly from the experience of previous policies. It is also possible to use human generated experience with experience replay. Experience replay can be used because Q-Learning is an off-policy learning algorithm.

A target network is used to improve learning stability by creating training Bellman targets from an older copy of the online *Q(s,a)* network. You can either copy the weights over every n steps or use a weighted average of previous parameters.

One of the issues with Q-Learning is the requirement of a discrete action space. In this example I discretize the action space into 100 actions. The balance with discretization is:

– too low = control is coarse

– too high = computational expense

I use a neural network to approximate *Q(s,a)*. I’m using TensorFlow as the library to provide the machinery for using and improving this simple two layer neural network. Even though I’m using the DQN algorithm I’m not using a particularly deep neural network.

I make use of relu’s between the layers and no batch normalization. I preprocess the inputs (removing mean and scaling by standard deviation) and targets (min-max normalization) used with the neural network using energy_py Processor objects. I use the Adam optimizer with a learning rate of 0.0025.

The network has one output node per action – since I choose to discretize the action space with 5 discrete actions for each action, there are 10 total discrete actions and 10 output nodes in the neural network.

There are a number of other hyperparameters to tune such as the rate of decay of epsilon for exploration and how frequently to update the target network to keep learning stable. I set these using similar ratios to the 2015 DeepMind Atari paper (adjusting the ratios for the total number of steps I train for each experiment).

**Figure 2 – DQN hyperparameters Mnih et. al (2015).**

### the environment – battery storage

The battery storage environment I’ve built is the application of storing cheap electricity and discharging when it’s expensive (price arbitrage.) This isn’t the only application of battery storage – Tesla’s 100 MW, 129 MWh battery in South Australia is being used for fast frequency response with impressive results.

I’ve tried to make the environment as Markov as possible – given a perfect forecast enough steps ahead I think battery storage problem is pretty Markov. The challenge using this in practice comes from having to use imperfect price forecasts.

The state space for the environment is the true price of electricity and the charge of the battery at the start of the step. The electricity price follows a fixed profile defined in state.csv.

The observation space is a perfect forecast of the electricity price five steps ahead. The number of steps ahead required for the Markov property will depend on the profile and the discount rate.

The action space is a one dimensional array – the first element being the charge and the second the discharge. The net effect of the action on the battery is the difference between the two.

The reward is the net rate of charge or discharge multiplied by the current price of electricity. The rate is net of an efficiency penalty applied to charging electricity. At a 90% efficiency a charge rate of 1 MW for one hour would result in only 0.9 MWh of electricity stored in the battery.

### results

The optimal operating strategy for energy storage is very application dependent. Given the large number of potential applications of storage this means a large number of optimal operating patterns are likely to exist.

The great thing about using reinforcement learning to learn these patterns is that we can use the same algorithm to learn any pattern. Building virtual environments for all these different applications is the first step in proving this.

Below is a HTML copy of the Jupyter Notebook used to run the experiment. You can see the same notebook on GitHub here.

### further work

Building the energy_py library is the most rewarding project in my career so far. I’ve been working on it for around one year, taking inspiration from other open source reinforcement learning libraries and improving my Python & reinforcement learning understanding along the way. My TODO list for energy_py is massive!

Lots of work to do to make the DQN code run faster. No doubt I’m making silly mistakes! I’m using two separate Python classes for the online and target network – it might be more efficient to have both networks be part of the same object. I also need to think about combining graph operations to reduce the number of sess.run(). Prioritized experience replay is another option to improve sample efficiency.

Less Markov & more realistic state and observation spaces – giving the agent imperfect forecasts. Multiple experiments across different random seeds.

**Test ability to generalize to unseen profiles**. This is the most important one. The current agent has the ability to memorize what to do (rather than understand the dynamics of the MDP).

I’ve just finished building a Deterministic Policy Gradient which I’m looking forward to playing around with.

Thanks for reading!

JacobFirst: it seems to me that if the agent is a price-taker and being used only for price arbitrage, the action space can be consolidated to three discrete options (charge at full power, discharge at full power, do nothing). What’s the intuition behind allowing partial (dis-)charging?

Second: if operating at the transmission level, the globally optimal solution should come from the battery taking part in the system operator’s optimization, rather than responding to prices after the system optimization is performed. In this case the ML problem becomes how to construct the optimal bid/offer curve. Is something preventing this in current markets?

Adam GreenPost authorPartial discharging is allowed to give the agent finer control – it is possible it might want to only charge at the rate of 1 MW. In reality, the size of the discrete action space should be tuned like any other hyperparameter.

The other impact here is the use of two actions (one for charging, one for discharging). This increases the discrete space size as well (one array for each combination of charge & discharge rates). This is an issue with Q-Learning – adding a single element to the action array exponentially increases the number of state-action combinations that need to be argmaxed over.

The action-space formulation could definitely be improved and it’s something I will think about.

The operation here is assuming that prices reflect whats actually happening on the grid. In many markets there is a disconnect – for example customer demand charges which are based on the customers load shape versus demand charges that reflect the global cost of the grid.