Saving of memory and value function after each episode
This quality of life improvement has a major impact on the effectiveness of training agents using energy_py. It means an agent can keep learning from experience that occurred during a different training session.
As I train models on my local machine I often can only dedicate enough time for 10 episodes of training. Saving the memory & value functions an agent to learn from hundreds of episodes without training every episode in one run.
Running each episode on a different time series
Training agents with randomly selected weeks in the year. It’s much more useful for an agent to experience two different weeks of CHP operation than to experience the same week over and over again. It also should help the agent to generalize to operating data sets it hasn’t seen before.
Building another agent has been a todo for energy_py for a long time. I’ve built a Double Q-Leaner – based on the algorithm given in Sutton & Barto. The key extension in Double Q-Learning is to maintain two value functions.
The policy is generated using the average of the estimate for both Q networks. One network is then randomly selected for training using a target created by the other network.
The thinking behind Double Q-Learning is that we can avoid the maximization bias of Q-Learning. A positive bias is caused by the use of maximization operations for estimating the value of states. The maximization functions lead to overoptimistic estimates of the value of state actions.
Next major tasks are:
1 – build a policy gradient method – most likely a Monte Carlo policy gradient algorithm,
2 – build a demand side response environment.
Thanks for reading!