Category Archives: Reinforcement Learning

energy_py update – July 2017

energy_py is a collection of reinforcement learning agents and environments for energy systems. You can read the introductory blog post for the project and check out the repo on GitHub.

Saving of memory and value function after each episode

This quality of life improvement has a major impact on the effectiveness of training agents using energy_py. It means an agent can keep learning from experience that occurred during a different training session.

As I train models on my local machine I often can only dedicate enough time for 10 episodes of training. Saving the memory & value functions an agent to learn from hundreds of episodes without training every episode in one run.

Running each episode on a different time series

Training agents with randomly selected weeks in the year. It’s much more useful for an agent to experience two different weeks of CHP operation than to experience the same week over and over again. It also should help the agent to generalize to operating data sets it hasn’t seen before.

Double Q-Learning

Building another agent has been a todo for energy_py for a long time. I’ve built a Double Q-Leaner – based on the algorithm given in Sutton & Barto. The key extension in Double Q-Learning is to maintain two value functions.

The policy is generated using the average of the estimate for both Q networks. One network is then randomly selected for training using a target created by the other network.

The thinking behind Double Q-Learning is that we can avoid the maximization bias of Q-Learning. A positive bias is caused by the use of maximization operations for estimating the value of states. The maximization functions lead to overoptimistic estimates of the value of state actions.

Next major tasks are:
1 – build a policy gradient method – most likely a Monte Carlo policy gradient algorithm,
2 – build a demand side response environment.

Thanks for reading!

A Glance at Q-Learning

‘A Glance at Q-Learning’ is a talk I recently gave at the Data Science Festival in London. The talk was one I also gave in Berlin at the Berlin Machine Learning group.

Q-Learning is a reinforcement learning algorithm that DeepMind used to play Atari games – work which some call the first step towards a general artificial intelligence. The original 2013 paper is available here (I cover this paper in the talk).

It was a wonderful experience being able to present – I recommend checking out more of the talks on the Data Science Festival YouTube – all of which are higher quality, more interesting and better presented than mine!

You can download a copy of my slides here – A Glance at Q-Learning slides.

Thanks for reading!

energy_py – reinforcement learning in energy systems

energy_py is reinforcement learning in energy systems.  It’s reinforcement learning agents and environments built in Python.

I have a vision of using reinforcement learners to optimally operate energy systems.  energy_py is a step towards this vision.  I’ve built this because I’m so excited about the potential of reinforcement learning in the energy industry.

Reinforcement learning in energy systems requires first proving the concepts in a virtual environment.  This project demonstrates the ability of reinforcement learning to control a virtual energy environment.

What is reinforcement learning

supervised vs unsupervised vs reinforcement

Reinforcement learning is the branch of machine learning where learning occurs through action.  Reinforcement learning will give us the tools to operate our energy systems at superhuman levels of performance.

It’s quite different from supervised learning. In supervised learning we start out with a big data set of features and our target. We train a model to replicate this target from patterns in the data.

In reinforcement learning we start out with no data. The agent generates data by interacting with the environment. The agent then learns patterns in this data. These patterns help the agent to choose actions that maximize total reward.

Why do we need reinforcement learning in energy systems

Optimal operation of energy assets is already very challenging. Our current energy transition is making this difficult problem even harder. The rise of intermittent and distributed generation is introducing volatility and increasing the number of actions available to operators.

For a wide range of problems machine learning results are both state of the art and better than human experts. We can get this level of performance using reinforcement learning in our energy systems.

Today many operators use rules or abstract models to dispatch assets. A set of rules is not able to guarantee optimal operation in many energy systems.

Optimal operating strategies can be developed from abstract models. Yet abstract models (such as linear programming) are often constrained. These models are limited to approximations of the actual plant.  Reinforcement learners are able to learn directly from their experience of the actual plant.

Reinforcement learning can also deal with non-linearity. Most energy systems exhibit non-linear behavior (in fact an energy balance is bi-linear!). Reinforcement learning can model non-linearity using neural networks. It is also able to deal with the non-stationary and hidden environment in many energy systems.

beautiful wind turbines

There are challenges to be overcome. The first and most important is safety. Safety is the number one concern in any engineering discipline. What is important to understand is we limit the actions available to the agent. All lower levels or systems of controls would remain exactly the same.

There is also the possibility to design the reward function to incentivize safety. A well-designed reinforcement learner could actually reduce hazards to operators.

A final challenge worth addressing is the impact such a learner could have on employment. Machine learning is not a replacement for human operators. A reinforcement learner would not need a reduction in employees to be a good investment.

The value of using a reinforcement learner is to let operations teams do their jobs better. It will allow them to spend more time and improve performance for their remaining responsibilities such as maintaining the plant.  The value created here is a better-maintained plant and a happier workforce – in a plant that is operating with superhuman levels of economic and environmental performance.

Any machine requires downtime – a reinforcement learner is no different. There will still be time periods where the plant will operate in manual or semi-automatic modes with human guidance.

energy_py is one step on a long journey of getting reinforcement learners helping us in the energy industry. The fight against climate change is the greatest that humanity faces. Reinforcement learning will be a key ally in fighting it.

 

energy_py is built using Python.  You can checkout the repository on GitHub here.

Initial Results

I’m really looking forward to getting to know energy_py.  There are a number of parameters to tune. For example the structure of the environment or the design of the reward function can be modified to make the reinforcement learning problem more challenging.

One key design choice is the number of assets the agent has to control.  The more choices available to the agent the more complex the shape of the value function becomes.  To approximate a more complex value function we may need a more complex neural network.

There is also a computational cost incurred with increasing the number of actions.  More actions means more [state, actions] to consider during action selection and training (both of which require value function predictions).

So far I’ve been experimenting with an environment based on two assets – a 7.5 MWe gas turbine and a 7.5 MWe gas engine.  The episode length is set to 336 steps (one week).  I run a single naive episode, 30 ε-greedy episodes and a single greedy episode.
Figure 1 below shows the total reward per episode increasing as the agent improves it’s estimate of the value function and spends less time exploring.
Figure 1 – Epsilon decay and total reward per episode

Figure 2 shows the Q-test and the network training history.  Q-test is the average of three random [state, actions] evaluated by the value function.  It shows how the value function approximation changes over time.

Figure 2 – Q-Test and the network training history

Figure 3 shows some energy engineering outputs.  I’m pretty happy with this operating regieme – the model is roughly following both the electricity price and the heat demand which is expected behaviour.

Figure 3 – Energy engineering outputs for the final greedy run

 One interesting decision to make is how often to improve the approximation of the value function.  David Silver makes the point that you should make use of the freshest data when training – hence training after each step through the environment.  He also makes the point that you don’t need to fully train the network – just train a ‘little bit’.

This makes sense as the distribution of the data (the replay memory) will change as the learner trains it’s value function.  I train on a 64 sample batch of the replay memory.  I perform 100 passes over the entire data set (i.e. 100 epochs).   Both values can be optimized.  It could make sense to train for more epochs in later episodes as we want to fit the data more than in earlier episodes.

Another challenge in energy_py is balancing exploration versus exploitation.  The Q_learner algorithm handles this dilemma using ε-greedy action selection.  I decay epsilon at a fixed rate – the optimal selection of this parameter is something I’ll take a look at in the future.

There are many exciting innovations developed recently in reinforcement learning that I’m keen to add to energy_py.  One example is the idea of Prioritized Experience Replay – where the batch is not taken randomly from the replay memory but instead prioritizes some samples over others.

It’s unlikely that I’ll ever catch up to the state of the art in reinforcement learning – what I hope of find is that we don’t need state of the art techniques to get superhuman performance from energy systems!

My reinforcement learning journey

I’m a chemical engineer by training (B.Eng, MSc) and an energy engineer by profession. I’m really excited about the potential of machine learning in the energy industry – in fact that’s what this blog is about!

My understanding of reinforcement learning has come from a variety of resources. I’d like to give credit to all of the wonderful resources I’ve used to understand reinforcement learning.

Sutton & Barto – Reinforcement Learning: An Introduction – the bible of reinforcement learning and a classic machine learning text.

Playing Blackjack with Monte Carlo Methods – I built my first reinforcement learning model to operate a battery using this post as a guide. This post is part two of an excellent three part series. Many thanks to Brandon of Δ ℚuantitative √ourney.

RL Course by David Silver – over 15 hours of lectures from Google DeepMind’s lead programmer – David Silver. Amazing resource from a brilliant mind and brillaint teacher.

Deep Q-Learning with Keras and gym – great blog post that showcases code for a reinforcement learning agent to control a Open AI Gym environment. Useful both for the gym integration and using Keras to build a non-linear value function approximation. Many thanks to Keon Kim – check out his blog here.

Artificial Intelligence and the Future – Demis Hassabis is the co-founder and CEO of Google DeepMind.  In this talk he gives some great insight into the AlphaGo project.

Minh et. al (2013) Playing Atari with Deep Reinforcement Learning – to give you an idea of the importance of this paper – Google purchased DeepMind after this paper was published.  DeepMind was a company with no revenue, no customers and no product – valued by Google at $500M!  This is a landmark paper in reinforcement learning.

Minh et. al (2015) Human-level control through deep reinforcement learning – an update to the 2013 paper published in Nature.

I would also like to thank Data Science Retreat.  I’m just finishing up the three month immersive program – energy_py is my project for the course.  Data Science Retreat has been a fantastic experience and I would highly recommend it.  The course is a great way to invest in yourself, develop professionally and meet amazing people.

That’s it from me – thanks for reading!

Monte Carlo Q-Learning to Operate a Battery

I have a vision for using machine learning for optimal control of energy systems.  If a neural network can play a video game, hopefully it can understand how to operate a power plant.

In my previous role at ENGIE I built Mixed Integer Linear Programming models to optimize CHP plants.  Linear Programming is effective in optimizing CHP plants but it has limitations.

I’ll detail these limitations in future post – this post is about Reinforcement Learning (RL).  RL is a tool that can solve some of the limitations inherent in Linear Programming.

In this post I introduce the first stage of my own RL learning process. I’ve built a simple model to charge/discharge a battery using Monte Carlo Q-Learning. The script is available on GitHub.

I made use of two excellent blog posts to develop this.  Both of these posts give a good introduction to RL:

Features of the script
 

As I don’t have access to a battery system I’ve built a simple model within Python.  The battery model takes as inputs the state at time t, the action selected by the agent and returns a reward and the new state.  The reward is the cost/value of electricity charged/discharged.

def battery(state, action):  # the technical model
    # battery can choose to :
    #    discharge 10 MWh (action = 0)
    #    charge 10 MWh or (action = 1)
    #    do nothing (action = 2)

    charge = state[0]  # our charge level
    SP = state[1]  # the current settlement period
    action = action  # our selected action
    prices = getprices()
    price = prices[SP - 1]  # the price in this settlement period

    if action == 0:  # discharging
        new_charge = charge - 10
        new_charge = max(0, new_charge)  
        charge_delta = charge - new_charge
        reward = charge_delta * price
    if action == 1:  # charging
        new_charge = charge + 10
        new_charge = min(100, new_charge)
        charge_delta = charge - new_charge
        reward = charge_delta * price
    if action == 2:  # nothing
        charge_delta = 0
        reward = 0

    new_charge = charge - charge_delta
    new_SP = SP + 1
    state = (new_charge, new_SP)
    return state, reward, charge_delta

The price of electricity varies throughout the day.
The model is not fed this data explicitly – instead it learns through interaction with the environment.
 
One ‘episode’ is equal to one day (48 settlement periods).  The model runs through thousands of iterations of episodes and learns the value of taking a certain action in each state.  
 
Learning occurs by apportioning the reward for the entire episode to every state/action that occurred during that episode. While this method works, more advanced methods do this in better ways.
def updateQtable(av_table, av_count, returns):
    # updating our Q (aka action-value) table
    # ********
    for key in returns:
        av_table[key] = av_table[key] + (1 / av_count[key]) * (returns[key] - av_table[key])
    return av_table
The model uses an epsilon-greedy method for action selection.  Epsilon is decayed as the number of episodes increases.
Results
 
Figure 1 below shows the the optimal disptach for the battery model after training for 5,000 episodes.  
Figure 1 – Electricity prices [£/MWh] and the optimal battery dispatch profile [%]
I’m happy the model is learning well. Charging occurs during periods of low electricity prices. It is also fully draining the battery at the end of the day – which is logical behavior to maximise the reward per episode.  
 

Figure 2 below shows the learning progress of the model.

Figure 2 – Model learning progress
Next steps
 
Monte Carlo Q-learning is a good first start for RL. It’s helped me to start to understand some of the key concepts.
 
Next steps will be developing more advanced Q-learning methods using neural networks.