Jekyll2023-12-04T05:01:49+00:00https://adgefficiency.com/feed.xmlADG EfficiencyThe intersection of energy and machine learningAdam Greenadam.green@adgefficiency.comFrom Vim & QWERTY to Neovim & DVORAK2023-11-21T00:00:00+00:002023-11-21T00:00:00+00:00https://adgefficiency.com/vim-neovim-dvorak-split-keyboard<blockquote>
<p>There are no veils, curtains, doors, walls or anything between what pours out of Bob’s hand onto the page and what is somehow available to the core of people who are believers in him.</p>
</blockquote>
<p><img src="%7B%7B%22/assets/dvorak/dylan.png%22%7D%7D" alt="" /></p>
<blockquote>
<p>There’s some people who’d say ‘You know, not interested’.</p>
<p>But if you’re interested, he goes way, way deep.</p>
<p>Joan Baez on Bob Dylan - No Direction Home</p>
</blockquote>
<p><strong>I’m a Neovim, DVORAK & split keyboard user</strong>.</p>
<p>This post details my transitions between these tools:</p>
<ul>
<li>Atom to Vim to Neovim,</li>
<li>QWERTY to DVORAK keyboard layout,</li>
<li>a traditional to split keyboard.</li>
</ul>
<p>At the end of this post is a table summarizing of each transition:</p>
<ul>
<li>how long it took,</li>
<li>the productivity increase,</li>
<li>health improvements,</li>
<li>whether I would recommend it.</li>
</ul>
<h1 id="the-journey">The Journey</h1>
<p>I started my programming journey in early 2017 - on Windows laptop using the now deceased Atom editor. I didn’t know any better!</p>
<p>Programming is both my profession and a hobby. <strong>I enjoy working on improving my tools and workflows</strong>, which have additional benefits of making me a more effective programmer and improving the health of my aging, tired body.</p>
<p><strong>Not all developers are like this</strong> - some of the best programmers I’ve worked with have no interest in changing keyboard shortcuts, let alone learning Lua to configure their text editor.</p>
<p>To each their own - <strong>but if you are interested, this goes way, way deep</strong>.</p>
<h1 id="vim">Vim</h1>
<p>I started to learn Vim in the Christmas holidays of 2017 - I cannot remember exactly why.</p>
<p><img src="%7B%7B%22/assets/dvorak/vim.png%22%7D%7D" alt="" /></p>
<p>The first few days were tough - it took around a week to feel comfortable with the basics of Vim such as <code class="language-plaintext highlighter-rouge">hjkl</code>, the different modes (Normal, Insert etc), moving between splits and moving to different places a file.</p>
<p><strong>After two weeks I felt as productive as I was in Atom</strong> - beyond that my productivity has become more powerful than you could possibly imagine.</p>
<p>Over the years I added colorschemes, plugins, keybinds, macros & abbreviations - <a href="https://github.com/ADGEfficiency/dotfiles/blob/master/dotfiles/.vimrc">you can find my final <code class="language-plaintext highlighter-rouge">.vimrc</code> here</a>.</p>
<p>I do still use Vim when I’m working on remote servers - sometimes I’ll clone my <a href="https://github.com/ADGEfficiency/dotfiles">dotfiles</a> if I’ll be working there for a while and don’t want to install Neovim.</p>
<p>Alongside Vim I use Tmux and fzf. <strong>Tmux and fzf are as crucial for making Vim your main text editor as Vim itself</strong>. Without any of the three, my terminal-based development style would not work. This is one of the places where people can get stuck with Vim - you need more than Vim to make a productive Vim setup.</p>
<p>Tmux is used for terminal multiplexing - allowing the ability to open terminal windows alongside each other or in different windows.</p>
<p>I use fzf for finding and opening files, both from the terminal with <code class="language-plaintext highlighter-rouge">**<TAB></code> and within Vim using <code class="language-plaintext highlighter-rouge"><Space></code> to run fzf in the current directory via a keybinding.</p>
<p>A healthy use of shell aliases and custom functions are also important to working with a terminal editor like Vim.</p>
<p>I use a script <code class="language-plaintext highlighter-rouge">s</code> to quickly use fzf to search for files to open in my <code class="language-plaintext highlighter-rouge">$EDITOR</code> from the current directory (<a href="https://github.com/ADGEfficiency/dotfiles/blob/master/scripts/s">script is here</a>):</p>
<p><img src="%7B%7B%22/assets/dvorak/s.png%22%7D%7D" alt="" /></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#!/usr/bin/env zsh
TERM_HEIGHT=$(tput lines)
MIN_HEIGHT=20
# prompt for files using fzf
files=$(if [ "$TERM_HEIGHT" -ge "$MIN_HEIGHT" ]; then
fzf --preview 'bat -p --color=always {}' --height 60% -m
else
fzf --no-preview --height 40% -m
fi)
# check if fzf was interrupted by Ctrl-C
if [ $? -eq 0 ]; then
$EDITOR ${(f)files}
fi
</code></pre></div></div>
<h2 id="do-i-recommend-vim">Do I Recommend Vim?</h2>
<p>Vim is amazing, but outdated - use Neovim instead.</p>
<p>Vim itself however still has a lot of awesomeness. The ecosystem of plugins and customization is still fantastic.</p>
<p>The initial configuration for Vim can be challenging - I would budget 1-2 days to get a basic setup working.</p>
<p>Even if you love VS Code, learning to use Vim is useful. It’s almost always available on remote servers, and it’s a better editor than other commonly available editors like nano.</p>
<p>Vim keybindings are also everywhere - you can enable them in the shell with <code class="language-plaintext highlighter-rouge">$ set -o vi</code> (instead of the default Emacs bindings), and many programs (IDEs or browsers) have Vim plugins.</p>
<h1 id="neovim">Neovim</h1>
<p>I started my transition to Neovim in July 2022 - motivated by the Vimscript 9 schism that divided the Vim community.</p>
<p><img src="%7B%7B%22/assets/dvorak/nvim.png%22%7D%7D" alt="" /></p>
<p>Transitioning to Neovim after 3 years of Vim was quick - the in-editor experience is very similar.</p>
<p>It took around half a day to convert my <code class="language-plaintext highlighter-rouge">.vimrc</code> to a functional Lua based setup, followed by a week or two of tweaking my config and adding plugins.</p>
<p>I was able to bring along all of my Vimscript plugins, which is a huge selling point of Neovim. I do prefer Lua written plugins where possible, but still use many of the same plugins as with Vim - <a href="https://github.com/ADGEfficiency/dotfiles/blob/master/nvim/lua/adam/plugins.lua">you can find all my Neovim plugins here</a>.</p>
<h2 id="do-i-recommend-neovim">Do I Recommend Neovim?</h2>
<p><strong>I would strongly recommend Neovim to anyone who is starting out with Vim or to experienced Vim users - it’s great</strong>.</p>
<p>Neovim is an improvement over Vim, and has a bubbling, exciting ecosystem of plugins and users. I have found the language servers, linting, formatting and completion experience an improvement over Vim.</p>
<p>If you are a Vim user it’s not a big transition - all of your Vimscript plugins will work as expected.</p>
<p>It’s nice to use Lua for configuration - it’s more flexible and is a more useful, transferable skill that Vimscript.</p>
<p>If you want to get started with Neovim, look at <a href="https://github.com/nvim-lua/kickstart.nvim">kickstart.nvim</a>.</p>
<h1 id="dvorak">DVORAK</h1>
<p>I started my transition to DVORAK in May 2019 - motivated by a desire to improve the health of my hands.</p>
<p><img src="%7B%7B%22/assets/dvorak/layout.png%22%7D%7D" alt="" /></p>
<p>After two and a half years of programming, I was suffering with muscular soreness & tiredness in my hands. My hands felt fatigued - like they were doing too much work.</p>
<p>I was aware that there were alternative keyboard layouts, designed to be kinder to our hands - a day or two of sporadic, random & repetitive searching on Google about the different options I decided to give Dvorak a shot.</p>
<p>It took me around 2 weeks to get back to a somewhat reasonable level of productivity, but I was not back to the my previous level of QWERTY.</p>
<p><strong>My typing remained inaccurate for a long time</strong> - I only felt like I was back to where I was in August 2021 - making the transition over a year long.</p>
<p>Sometimes my typing is still less accurate than it was (particularly as I use a keyboard with blank keycaps) - but it’s manageable. I don’t feel like DVORAK led to a significant productivity improvement.</p>
<h2 id="do-i-recommend-dvorak">Do I Recommend DVORAK?</h2>
<p>I would not recommend the DVORAK layout - while I’m glad I have done it and wouldn’t switch back, it takes a long, long time to get used to.</p>
<p>I can still type QWERTY if needed - it’s keyboard & context dependent. I can still type QWERTY on my phone without even realizing it’s a different layout.</p>
<h1 id="split-keyboard">Split Keyboard</h1>
<p>I started using a split keyboard in July 2021 - motivated by a desire to improve the health of my back.</p>
<p><img src="%7B%7B%22/assets/dvorak/ergo.png%22%7D%7D" alt="" /></p>
<p>Previously I had used the Apple Keyboard, then moved to a <a href="https://vortexgear.store/products/race-3-micro-usb">Vortex Race 3</a>, which I still use today. The split keyboard I use today is the <a href="https://ergodox-ez.com">Ergodox EZ</a>.</p>
<p><strong>The main benefit of a split keyboard is that your hands rest further apart</strong>. This allows your chest to expand, and reduces the strain on your upper and middle back - in particular reducing pain between the shoulder blades.</p>
<p>It took around 1 week to get back to the same level of productivity as a QWERTY keyboard. I have found a moderate level of productivity increase using a split keyboard.</p>
<p>The Ergodox EZ allows customization the keyboard layout using ORYX - <a href="https://configure.zsa.io/ergodox-ez/layouts/vJLGQ/latest/0">you can find my layout here</a>.</p>
<h2 id="do-i-recommend-a-split-keyboard">Do I Recommend a Split Keyboard?</h2>
<p>I would recommend a split keyboard, especially if you have back pain in between your shoulder blades - it’s a small amount of time investment for a real health benefit.</p>
<p>I have no problem going back to a normal keyboard - unlike DVORAK, using a split keyboard will not impact your ability to use a normal keyboard.</p>
<h1 id="thoughts-on-dvorak-and-vim">Thoughts on DVORAK and Vim</h1>
<p>The combination of DVORAK and Vim is an interesting one - both are very opinionated about how you should use your keyboard.</p>
<p>I was already a proficient Vim user when I decided to switch to DVORAK.</p>
<p><strong>Foundational to any keyboard layout and Vim is remapping <code class="language-plaintext highlighter-rouge"><CAPSLOCK></code> to <code class="language-plaintext highlighter-rouge"><ESCAPE></code></strong>. In Vim you use the <code class="language-plaintext highlighter-rouge"><ESCAPE></code> key to move from insert to normal mode - easy access to the escape key is essential.</p>
<h2 id="why-dvorak">Why DVORAK?</h2>
<p>Most computer keyboards are laid out in QWERTY - named for the keys in the first row.</p>
<p>The big idea in Dvorak is the importance of the middle row (also known as the home row).</p>
<p>Vim users know the importance of the home row from <code class="language-plaintext highlighter-rouge">hjkl</code> - the keys used for cursor movement in Vim. Dvorak puts all the vowels on the home row - the keys you access the most are closest to your fingers.</p>
<p>The other notable feature of Dvorak is the location of the punctuation characters <code class="language-plaintext highlighter-rouge">' , .</code> - these are located in a prime position.</p>
<p>An interesting thing about learning DVORAK was that the time to learn keys is long tailed. Some (such as <code class="language-plaintext highlighter-rouge">, .</code> and <code class="language-plaintext highlighter-rouge">aoeu</code>) come very easily, while others like <code class="language-plaintext highlighter-rouge">r y f g</code> took a while.</p>
<h2 id="losing-hjkl">Losing hjkl</h2>
<p>In Vim <code class="language-plaintext highlighter-rouge">hjkl</code> are used for cursor movement - they are the keys you use to move your cursor around a file in normal mode.</p>
<p>In DVORAK you lose the position and order of <code class="language-plaintext highlighter-rouge">hjkl</code>. Initially I considered remapping <code class="language-plaintext highlighter-rouge">hjkl</code> to the same position as QWERTY, but decided against it. It’s been fine.</p>
<h2 id="combinations-that-work-great">Combinations That Work Great</h2>
<p>There are some common Vim key combinations that feel great in Dvorak.</p>
<p><code class="language-plaintext highlighter-rouge">:</code> (Vim command mode) is easy access. <code class="language-plaintext highlighter-rouge">:w</code> and <code class="language-plaintext highlighter-rouge">:wq</code> feel great - you don’t need to move either hand.</p>
<p><code class="language-plaintext highlighter-rouge">"</code>, <code class="language-plaintext highlighter-rouge">,</code> and <code class="language-plaintext highlighter-rouge">.</code> are easy access. <code class="language-plaintext highlighter-rouge">.py</code> are all next to each other. <code class="language-plaintext highlighter-rouge">ls</code> is right next to each other.</p>
<p><code class="language-plaintext highlighter-rouge">gcc</code> is easy access and <code class="language-plaintext highlighter-rouge"><C-r></code> requires no hand movement.</p>
<h2 id="challenges">Challenges</h2>
<p>One challenge is anything <code class="language-plaintext highlighter-rouge">g</code> or <code class="language-plaintext highlighter-rouge">f</code> related. In Vim <code class="language-plaintext highlighter-rouge">gf</code> opens a file under the cursor - as these two keys are next to each other, it requires moving both hands from their natural position.</p>
<p>Another challenge are the <code class="language-plaintext highlighter-rouge">{}</code> and <code class="language-plaintext highlighter-rouge">[]</code> keys - on a DVORAK layout, these are hard to get at. A split keyboard helps this a lot, as you can put these on the thumb keys.</p>
<h1 id="summary">Summary</h1>
<p>Here is a summary of each of the tool and workflow transitions - years of human experience reduced to a Markdown table:</p>
<table>
<thead>
<tr>
<th>Transition</th>
<th>Time to Transition</th>
<th>Initial Setup Required</th>
<th>Productivity Increase</th>
<th>Health Improvement</th>
<th>Recommended</th>
</tr>
</thead>
<tbody>
<tr>
<td>Atom to Vim</td>
<td>2 weeks</td>
<td>1 day</td>
<td>High</td>
<td>None</td>
<td>✅</td>
</tr>
<tr>
<td>Vim to Neovim</td>
<td>1/2 day</td>
<td>1/2 day</td>
<td>Moderate</td>
<td>None</td>
<td>✅</td>
</tr>
<tr>
<td>QWERTY to DVORAK</td>
<td>>1 year</td>
<td>None</td>
<td>None</td>
<td>Less hand fatigue</td>
<td>❌</td>
</tr>
<tr>
<td>Split Keyboard</td>
<td>3 weeks</td>
<td>1/2 day</td>
<td>Moderate</td>
<td>Back pain relief</td>
<td>✅</td>
</tr>
</tbody>
</table>
<hr />
<p>Thanks for reading!</p>
<p>Take a look at my <a href="https://github.com/ADGEfficiency/dotfiles">dotfiles</a> if you’re interested in my setup - <a href="https://github.com/ADGEfficiency/dotfiles/tree/master/nvim">my Lua Neovim config is here</a>.</p>Adam Greenadam.green@adgefficiency.comAnd never back again.Measuring Forecast Accuracy with Linear Programming2023-02-23T00:00:00+00:002023-02-23T00:00:00+00:00https://adgefficiency.com/energy-py-linear-forecast-quality<p>This post introduces a methodology to measure the accuracy of an electricity price forecast using linear programming.</p>
<h2 id="predictive-accuracy-vs-business-value">Predictive Accuracy vs. Business Value</h2>
<p>The ideal forecast quality measurement directly aligns with a key business metric. Models are not often able to be trained in this way - often models are trained using error measures that will look familiar to anyone who does gradient based optimization, such as mean squared error.</p>
<p>This post uses a linear programming to measure forecast quality in terms of a key business metric - cost.</p>
<p>A battery operating in price arbitrage is optimized using actual prices and forecast prices.</p>
<p>The forecast error can then be quantified by how much money dispatching the battery using the forecast leaves on the table versus dispatching with perfect foresight of prices.</p>
<h1 id="data">Data</h1>
<p>This work uses <a href="https://github.com/ADGEfficiency/energy-py-linear">energy-py-linear</a> for the battery linear program - you can find the code & data in <a href="https://github.com/ADGEfficiency/energy-py-linear/blob/main/examples/forecast-accuracy.py">examples/forecast-accuracy.py</a> - the full source code is also available at the bottom of this post.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">$</span> <span class="n">pip</span> <span class="n">install</span> <span class="n">energypylinear</span>
</code></pre></div></div>
<p>The dataset used is a single sample of the South Australian trading price and the AEMO predispatch price forecast.</p>
<p>Both the price and forecast are supplied by AEMO for the National Electricity Market (NEM) in Australia.</p>
<p>A simple plot of the price and forecast is show below in Figure 1:</p>
<p><img src="/assets/linear-forecast/forecast.png" alt="" /></p>
<center>
<em>Figure 1 - South Australian trading price and predispatch forecast from July 2018.</em>
</center>
<h1 id="method">Method</h1>
<p>First we create an instance of the <code class="language-plaintext highlighter-rouge">Battery</code> class. We use a large capacity battery so that the battery will chase after all possible arbitrage opportunities with a roundtrip efficiency of 100%.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">energypylinear</span> <span class="k">as</span> <span class="n">epl</span>
<span class="n">asset</span> <span class="o">=</span> <span class="n">epl</span><span class="p">.</span><span class="n">battery</span><span class="p">.</span><span class="n">Battery</span><span class="p">(</span><span class="n">power_mw</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">capacity_mwh</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">efficiency</span><span class="o">=</span><span class="mf">0.9</span><span class="p">)</span>
</code></pre></div></div>
<p>We then dispatch the battery using perfect foresight of prices:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">actuals</span> <span class="o">=</span> <span class="n">asset</span><span class="p">.</span><span class="n">optimize</span><span class="p">(</span>
<span class="n">electricity_prices</span><span class="o">=</span><span class="n">data</span><span class="p">[</span><span class="s">'Trading Price [$/MWh]'</span><span class="p">],</span>
<span class="n">freq_mins</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div>
<p>Next we dispatch using the forecast:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">forecasts</span> <span class="o">=</span> <span class="n">asset</span><span class="p">.</span><span class="n">optimize</span><span class="p">(</span>
<span class="n">electricity_prices</span><span class="o">=</span><span class="n">data</span><span class="p">[</span><span class="s">'Predispatch Forecast [$/MWh]'</span><span class="p">],</span>
<span class="n">freq_mins</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div>
<p>We can then create <code class="language-plaintext highlighter-rouge">epl.Account</code> objects to represent the financials for these two simulations.</p>
<p>The trick is using the actuals interval data with the forecast simulation in <code class="language-plaintext highlighter-rouge">forecast_account</code> - this evaluates the economics with actual prices but dispatch optimized for forecasts:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># calculate the variance between accounts
</span><span class="n">actual_account</span> <span class="o">=</span> <span class="n">epl</span><span class="p">.</span><span class="n">get_accounts</span><span class="p">(</span><span class="n">actuals</span><span class="p">.</span><span class="n">interval_data</span><span class="p">,</span> <span class="n">actuals</span><span class="p">.</span><span class="n">simulation</span><span class="p">)</span>
<span class="n">forecast_account</span> <span class="o">=</span> <span class="n">epl</span><span class="p">.</span><span class="n">get_accounts</span><span class="p">(</span><span class="n">actuals</span><span class="p">.</span><span class="n">interval_data</span><span class="p">,</span> <span class="n">forecasts</span><span class="p">.</span><span class="n">simulation</span><span class="p">)</span>
<span class="n">variance</span> <span class="o">=</span> <span class="n">actual_account</span> <span class="o">-</span> <span class="n">forecast_account</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">forecast error: $ </span><span class="si">{</span><span class="o">-</span><span class="mi">1</span> <span class="o">*</span> <span class="n">variance</span><span class="p">.</span><span class="n">cost</span><span class="p">:</span><span class="mf">2.2</span><span class="n">f</span><span class="si">}</span><span class="s"> pct: </span><span class="si">{</span><span class="mi">100</span> <span class="o">*</span> <span class="n">variance</span><span class="p">.</span><span class="n">cost</span> <span class="o">/</span> <span class="n">actual_account</span><span class="p">.</span><span class="n">cost</span><span class="p">:</span><span class="mf">2.1</span><span class="n">f</span><span class="si">}</span><span class="s"> %"</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-shell-session highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">forecast error: $</span><span class="w"> </span>92.97 pct: 28.5 %
</code></pre></div></div>
<h1 id="discussion">Discussion</h1>
<h2 id="extend-to-different-domains">Extend to Different Domains</h2>
<p>The method above is specific to using batteries for wholesale price arbitrage.</p>
<p>The idea of using variance between two optimization runs with different inputs can be extended to many business problems.</p>
<p>If there is any error in the optimization (say to a local minima) then the final quality measurement combines the error from both forecasting and from the optimization that used the forecast.</p>
<p>A large capacity battery operating in price arbitrage does somewhat resemble arbitrage of stocks, so the error measurement might be useful for comparing forecasts. It’s less clear how useful this model would be for a temperature prediction.</p>
<h2 id="negative-value">Negative Value</h2>
<p>A challenge with using this measurement of forecast error is what happens when the net benefit of dispatching the battery to a forecast - i.e. when the forecast quality is so bad that using it ends up losing money. Unlike other error measures such as mean squared error it’s not appropriate to simply take the absolute.</p>
<h1 id="full-example">Full Example</h1>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">io</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">energypylinear</span> <span class="k">as</span> <span class="n">epl</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="c1"># price and forecast csv data
</span> <span class="n">data</span> <span class="o">=</span> <span class="s">"""
Timestamp,Trading Price [$/MWh],Predispatch Forecast [$/MWh]
2018-07-01 17:00:00,177.11,97.58039000000001
2018-07-01 17:30:00,135.31,133.10307
2018-07-01 18:00:00,143.21,138.59978999999998
2018-07-01 18:30:00,116.25,128.09559
2018-07-01 19:00:00,99.97,113.29413000000001
2018-07-01 19:30:00,99.71,113.95063
2018-07-01 20:00:00,97.81,105.5491
2018-07-01 20:30:00,96.1,102.99768
2018-07-01 21:00:00,98.55,106.34366000000001
2018-07-01 21:30:00,95.78,91.82700000000001
2018-07-01 22:00:00,98.46,87.45
2018-07-01 22:30:00,91.88,85.65775
2018-07-01 23:00:00,91.69,85.0
2018-07-01 23:30:00,101.2,85.0
2018-07-02 00:00:00,139.55,80.99999
2018-07-02 00:30:00,102.9,75.85762
2018-07-02 01:00:00,83.86,67.86758
2018-07-02 01:30:00,71.1,70.21946
2018-07-02 02:00:00,60.35,62.151
2018-07-02 02:30:00,56.01,62.271919999999994
2018-07-02 03:00:00,51.22,56.79063000000001
2018-07-02 03:30:00,48.55,53.8532
2018-07-02 04:00:00,55.17,53.52591999999999
2018-07-02 04:30:00,56.21,49.57504
2018-07-02 05:00:00,56.32,48.42244
2018-07-02 05:30:00,58.79,54.15495
2018-07-02 06:00:00,73.32,58.01054
2018-07-02 06:30:00,80.89,68.31508000000001
2018-07-02 07:00:00,88.43,85.0
2018-07-02 07:30:00,201.43,119.73926999999999
2018-07-02 08:00:00,120.33,308.88984
2018-07-02 08:30:00,113.26,162.32117
"""</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">io</span><span class="p">.</span><span class="n">StringIO</span><span class="p">(</span><span class="n">data</span><span class="p">))</span>
<span class="c1"># battery model
</span> <span class="n">asset</span> <span class="o">=</span> <span class="n">epl</span><span class="p">.</span><span class="n">battery</span><span class="p">.</span><span class="n">Battery</span><span class="p">(</span><span class="n">power_mw</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">capacity_mwh</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">efficiency</span><span class="o">=</span><span class="mf">0.9</span><span class="p">)</span>
<span class="c1"># optimize for actuals
</span> <span class="n">actuals</span> <span class="o">=</span> <span class="n">asset</span><span class="p">.</span><span class="n">optimize</span><span class="p">(</span>
<span class="n">electricity_prices</span><span class="o">=</span><span class="n">data</span><span class="p">[</span><span class="s">"Trading Price [$/MWh]"</span><span class="p">],</span>
<span class="n">freq_mins</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span>
<span class="p">)</span>
<span class="c1"># optimize for forecasts
</span> <span class="n">forecasts</span> <span class="o">=</span> <span class="n">asset</span><span class="p">.</span><span class="n">optimize</span><span class="p">(</span>
<span class="n">electricity_prices</span><span class="o">=</span><span class="n">data</span><span class="p">[</span><span class="s">"Predispatch Forecast [$/MWh]"</span><span class="p">],</span>
<span class="n">freq_mins</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span>
<span class="p">)</span>
<span class="c1"># calculate the variance between accounts
</span> <span class="n">actual_account</span> <span class="o">=</span> <span class="n">epl</span><span class="p">.</span><span class="n">get_accounts</span><span class="p">(</span><span class="n">actuals</span><span class="p">.</span><span class="n">interval_data</span><span class="p">,</span> <span class="n">actuals</span><span class="p">.</span><span class="n">simulation</span><span class="p">)</span>
<span class="n">forecast_account</span> <span class="o">=</span> <span class="n">epl</span><span class="p">.</span><span class="n">get_accounts</span><span class="p">(</span><span class="n">actuals</span><span class="p">.</span><span class="n">interval_data</span><span class="p">,</span> <span class="n">forecasts</span><span class="p">.</span><span class="n">simulation</span><span class="p">)</span>
<span class="n">variance</span> <span class="o">=</span> <span class="n">actual_account</span> <span class="o">-</span> <span class="n">forecast_account</span>
<span class="k">print</span><span class="p">(</span>
<span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">forecast error: $ </span><span class="si">{</span><span class="o">-</span><span class="mi">1</span> <span class="o">*</span> <span class="n">variance</span><span class="p">.</span><span class="n">cost</span><span class="p">:</span><span class="mf">2.2</span><span class="n">f</span><span class="si">}</span><span class="s"> pct: </span><span class="si">{</span><span class="mi">100</span> <span class="o">*</span> <span class="n">variance</span><span class="p">.</span><span class="n">cost</span> <span class="o">/</span> <span class="n">actual_account</span><span class="p">.</span><span class="n">cost</span><span class="p">:</span><span class="mf">2.1</span><span class="n">f</span><span class="si">}</span><span class="s"> %"</span>
<span class="p">)</span>
<span class="s">"""
forecast error: $ 92.97 pct: 28.5 %
"""</span>
</code></pre></div></div>
<h1 id="summary">Summary</h1>
<p>This post introduces a method for measuring forecast accuracy using linear optimization of electric battery storage, by looking at the difference between two optimization runs given actual and forecast prices as input.</p>
<hr />
<p>Thanks for reading!</p>Adam Greenadam.green@adgefficiency.comUsing optimization of a battery to measure forecast accuracy.Mistakes Data Scientists Make2023-02-22T00:00:00+00:002023-02-22T00:00:00+00:00https://adgefficiency.com/mistakes-data-scientist<h1 id="introduction">Introduction</h1>
<p>Patterns exist in the mistakes data scientists make - this article lists some of the most common mistakes data scientists make when learning their craft.</p>
<blockquote>
<p>An expert is a person who has made all the mistakes that can be made in a very narrow field.</p>
<p>Niels Bohr</p>
</blockquote>
<p>I’ve learnt from all these mistakes - I hope you can learn from them too.</p>
<h1 id="plot-the-target">Plot the Target</h1>
<p>Prediction separates the data scientist from the data analyst. The data analyst analyzes the past - the data scientist predicts the future.</p>
<p>Using features to predict a target is supervised learning. The target can be either a number (regression) or a category (classification).</p>
<p><strong>Understanding the distribution of the target is a must-do for any supervised learning project</strong>.</p>
<p>The distribution of the target will inform many decisions a data scientist makes, including:</p>
<ul>
<li>what models to consider using</li>
<li>whether scaling is required</li>
<li>if the target has outliers that should be removed</li>
<li>if the target is imbalanced</li>
</ul>
<h2 id="regression">Regression</h2>
<p>In a regression problem, a data scientist wants to know the following about the target:</p>
<ul>
<li>the minimum & maximum</li>
<li>how normally distributed the target it</li>
<li>if the distribution is multi-modal</li>
<li>if there are outliers</li>
</ul>
<p><strong>A histogram will answer all of these - making it an excellent choice for visualizing the target in regression problems</strong>.</p>
<p>The code below generates a toy dataset of four distributions and plots a histogram:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">concatenate</span><span class="p">([</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">10000</span><span class="p">),</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="o">-</span><span class="mi">5</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">10000</span><span class="p">),</span>
<span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="o">-</span><span class="mi">20</span><span class="p">,</span> <span class="mi">20</span><span class="p">]</span> <span class="o">*</span> <span class="mi">1000</span><span class="p">)</span>
<span class="p">])</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">).</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">'hist'</span><span class="p">,</span> <span class="n">legend</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
</code></pre></div></div>
<center><img src="/assets/mistakes-data-sci/reg.png" width="50%" /></center>
<p>The histogram shows the two normal and two uniform distributions that generated this dataset.</p>
<h2 id="classification">Classification</h2>
<p>In a classification problem, a data scientist wants to know the following about the target:</p>
<ul>
<li>how many classes there are</li>
<li>how balanced are the classes</li>
</ul>
<p><strong>We can answer these questions using a single bar chart</strong>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">[</span><span class="s">'awake'</span><span class="p">]</span> <span class="o">*</span> <span class="mi">1000</span> <span class="o">+</span> <span class="p">[</span><span class="s">'asleep'</span><span class="p">]</span> <span class="o">*</span> <span class="mi">500</span> <span class="o">+</span> <span class="p">[</span><span class="s">'dreaming'</span><span class="p">]</span> <span class="o">*</span> <span class="mi">50</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">(</span><span class="n">data</span><span class="p">).</span><span class="n">value_counts</span><span class="p">().</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">'bar'</span><span class="p">)</span>
</code></pre></div></div>
<center><img src="/assets/mistakes-data-sci/class.png" width="50%" /></center>
<p>The bar chart shows us we have three classes, and shows our <code class="language-plaintext highlighter-rouge">dreaming</code> class is under-represented.</p>
<h1 id="dimensionality">Dimensionality</h1>
<p><strong>Dimensionality provides structure for understanding the world</strong>. An experienced data scientist learns to see the dimensions of data.</p>
<h2 id="the-value-of-low-dimensional-data">The Value of Low Dimensional Data</h2>
<p>In business, lower dimensional representations are more valuable than high dimensional representations. <strong>Business decisions are made in low dimensional spaces</strong>.</p>
<p>Notice that much of the work of a data scientist is using machine learning to reduce dimensionality:</p>
<ul>
<li>using pixels in an satellite image to predict solar power output,</li>
<li>using wind turbine performance data to estimate the probability of future breakdown,</li>
<li>using customer data to predict customer lifetime value.</li>
</ul>
<p><strong>Each of the outputs can be used by a business ways the raw data can’t</strong>. Unlike their high dimensional raw data inputs, the lower dimensional outputs can be used to make decisions:</p>
<ul>
<li>solar power output can be used to guide energy trader actions,</li>
<li>a high wind turbine breakdown probability can lead to a maintenance team being sent out,</li>
<li>a low customer lifetime estimation can lead to less money budgeting for marketing.</li>
</ul>
<p>The above are examples of the interaction between prediction and control. The better you are able to predict the world, the better you can control it.</p>
<p>This is also a working definition of a data scientist - <strong>making predictions that lead to action - actions that change how a business is run</strong>.</p>
<h2 id="the-challenges-of-high-dimensional-data">The Challenges of High Dimensional Data</h2>
<p>The difficulty of working in high dimensional spaces is known as the <strong>curse of dimensionality</strong>.</p>
<p>To understand the curse of dimensionality we need to reason about the <em>space</em> and <em>density</em> of data. We can imagine a dense dataset - a large number of diverse samples within a small space. We can also imagine a sparse dataset - a small number of samples in a large space.</p>
<p>What happens to the density of a dataset as we add dimensions? It becomes less dense, because the data is now more spread out.</p>
<p>However, the decrease of data density with increasing dimensionality is not linear - it’s exponential. <strong>The space becomes exponentially harder to understand as we increase dimensions</strong>.</p>
<p>Why is the increase exponential? Because this new dimension needs to be understood not only in terms of the each other dimension (which would be linear) but in terms of the <strong>combination of every other dimension with every other dimension</strong> (which is exponential).</p>
<p>This is the curse of dimensionality - the exponential increase of space as we add dimensions. The code below show this effect:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">itertools</span>
<span class="k">def</span> <span class="nf">calc_num_combinations</span><span class="p">(</span><span class="n">data</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">itertools</span><span class="p">.</span><span class="n">permutations</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">))))</span>
<span class="k">def</span> <span class="nf">test_calc_num_combinations</span><span class="p">():</span>
<span class="s">"""To test it works :)"""</span>
<span class="n">test_data</span> <span class="o">=</span> <span class="p">(((</span><span class="mi">0</span><span class="p">,</span> <span class="p">),</span> <span class="mi">1</span><span class="p">),</span> <span class="p">((</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="mi">2</span><span class="p">),</span> <span class="p">((</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span class="mi">6</span><span class="p">))</span>
<span class="k">for</span> <span class="n">data</span><span class="p">,</span> <span class="n">length</span> <span class="ow">in</span> <span class="n">test_data</span><span class="p">:</span>
<span class="k">assert</span> <span class="n">length</span> <span class="o">==</span> <span class="n">calc_num_combinations</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">test_calc_num_combinations</span><span class="p">()</span>
<span class="k">print</span><span class="p">([(</span><span class="n">length</span><span class="p">,</span> <span class="n">calc_num_combinations</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">length</span><span class="p">)))</span> <span class="k">for</span> <span class="n">length</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">11</span><span class="p">)])</span>
<span class="s">"""
[(0, 1),
(1, 1),
(2, 2),
(3, 6),
(4, 24),
(5, 120),
(6, 720),
(7, 5040),
(8, 40320),
(9, 362880),
(10, 3628800)]
"""</span>
</code></pre></div></div>
<p>The larger the size of the space, the more work a machine learning model needs to do to understand it.</p>
<p><strong>This is why adding features with no signal is painful</strong>. Not only does the model need to learn it’s noise - it needs to do this by considering how this noise interacts with each combination of every other column.</p>
<h2 id="applying-the-curse-of-dimensionality">Applying the Curse of Dimensionality</h2>
<p>Getting a theoretical understanding of dimensionality is step one. <strong>Next is applying it in the daily practice of data science</strong>. Below we will go through a few practical cases where data scientists can not apply the curse of dimensionality to their own workflow.</p>
<h3 id="too-many-hyperparameters">Too Many Hyperparameters</h3>
<p><strong>Data scientists can waste time doing excessive grid searching</strong> - expensive in both time and compute. The motivation of complex grid searches come from a good place - the desire for good (or even <em>perfect</em>) hyperparameters.</p>
<p>Yet we now know that adding just one additional search means an exponential increase in models trained - because this new search parameter needs to be tested in combination with every other search parameter.</p>
<p><strong>Another mistake is narrow grid searches</strong> - searching over small ranges of hyperparameters. A logarithmic scale will be more informative than a small linear range:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># this search isn't wide enough
</span><span class="n">useless_search</span> <span class="o">=</span> <span class="n">sklearn</span><span class="p">.</span><span class="n">model_selection</span><span class="p">.</span><span class="n">GridSearchCV</span><span class="p">(</span>
<span class="n">sklearn</span><span class="p">.</span><span class="n">ensemble</span><span class="p">.</span><span class="n">RandomForestRegressor</span><span class="p">(</span><span class="n">n_estimators</span><span class="o">=</span><span class="mi">10</span><span class="p">),</span> <span class="n">param_grid</span><span class="o">=</span><span class="p">{</span><span class="s">'n_estimators'</span><span class="p">:</span> <span class="p">[</span><span class="mi">10</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">20</span><span class="p">]</span>
<span class="p">)</span>
<span class="c1"># this search is more informative
</span><span class="n">useful_search</span> <span class="o">=</span> <span class="n">sklearn</span><span class="p">.</span><span class="n">model_selection</span><span class="p">.</span><span class="n">GridSearchCV</span><span class="p">(</span>
<span class="n">sklearn</span><span class="p">.</span><span class="n">ensemble</span><span class="p">.</span><span class="n">RandomForestRegressor</span><span class="p">(</span><span class="n">n_estimators</span><span class="o">=</span><span class="mi">10</span><span class="p">),</span> <span class="n">param_grid</span><span class="o">=</span><span class="p">{</span><span class="s">'n_estimators'</span><span class="p">:</span> <span class="p">[</span><span class="mi">10</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">1000</span><span class="p">]</span>
<span class="p">)</span>
</code></pre></div></div>
<!-- Different projects require different amounts of grid searching, over both models and their hyperparameters. I find that I often build two grid searching pipelines: -->
<ul>
<li>one to compare different models (using the best hyperparameters found so far for each)</li>
<li>one to compare different hyperparameters for a single model</li>
</ul>
<p>I’ll start by comparing models in the first pipeline, then doing further tuning on a single model in the second grid search pipeline. Once a model is reasonably tuned, it’s best hyperparameters can be put into the first grid search pipeline.</p>
<p>The fine tuning on a single model is often searches over a single parameter at a time (two maximum). This keeps the runtime short, and also helps to develop intuition about what effect changing hyperparameters will have on model performance.</p>
<h3 id="too-many-features">Too Many Features</h3>
<p>A misconception I had as a junior data scientist was that adding features had no cost. Put them all in and let the model figure it out! We can now easily see the naivety of this - more features has as exponential cost.</p>
<p><strong>This misconception came from a fundamental misunderstanding of deep learning</strong>.</p>
<p>Seeing the results in computer vision, where deep neural networks do all the work of feature engineering from raw pixels, I thought that the same would be true of using neural networks on other data. I was making two mistakes here:</p>
<ul>
<li>not appreciating the useful inductive bias of convolutional neural networks</li>
<li>not appreciating the curse of dimensionality</li>
</ul>
<p>We know now there is an exponential cost to adding more features. This also should change how you look at one-hot encoding, which dramatically increases the space that a model needs to understand, with low density data.</p>
<h3 id="too-many-metrics">Too Many Metrics</h3>
<p>In data science projects, performance is judged using metrics such as training or test performance.</p>
<p>In industry, a data scientist will choose metrics that align with the goals of the business. Different metrics have different trade-offs - part of a data scientists job is to select metrics that correlate best with the objectives of the business.</p>
<p>However, it’s common for junior data scientists to report a range of different metrics. For example, on a regression problem they might report three metrics:</p>
<ul>
<li>mean absolute error</li>
<li>mean absolute percentage error</li>
<li>root mean squared error</li>
</ul>
<p>Combine this with reporting a test & train error (or test & train per cross validation fold), the number of metrics becomes too many to glance at and make decisions with.</p>
<p><strong>Pick one metric that best aligns with your business goal and stick with it</strong>. Reduce the dimensionality of your metrics so you can take actions with them.</p>
<h2 id="too-many-models">Too Many Models</h2>
<p>Data scientists are lucky to have access to many high quality implementations of models in open source packages such as <code class="language-plaintext highlighter-rouge">scikit-learn</code>.</p>
<p>This can become a problem when data scientists repeatedly train a suite of models without a deliberate reason why these models should be looked at in parallel. Linear models are trained over and over, without ever seeing the light outside a notebook.</p>
<p>Quite often I see a new data scientist train a linear model, an SVM and a random forest. An experienced data scientist will just train a tree based ensemble (a random forest or XGBoost), and focus on using the feature importances to either engineer or drop features.</p>
<p><strong>Why is are tree based ensembles a good first model?</strong> A few reasons:</p>
<ul>
<li>they can be used for either regression or classification,</li>
<li>no scaling of target or features required,</li>
<li>training can be parallelized across CPU cores,</li>
<li>they perform well on tabular data,</li>
<li>feature importances are interpretable.</li>
</ul>
<h1 id="learning-rate">Learning Rate</h1>
<p>If there is one hyperparameter worthy of searching over when training neural networks it is learning rate (second is batch size). <strong>Setting the learning rate too high will make training of neural networks unstable</strong> - LSTM’s especially. What the learning rate does is quite intuitive - higher learning rate means faster training.</p>
<p><strong>Batch size is less intuitive</strong> - a smaller batch size will mean high variance gradients, but some of the value of batches is using that variance to break out of local minima. In general, batch size should be as large as possible to improve gradient quality - often it is limited by GPU memory.</p>
<h1 id="where-error-comes-from">Where Error Comes From</h1>
<p>Three sources of error are:</p>
<ul>
<li>sampling error - using statistics estimated on a subset of a larger population,</li>
<li>sampling bias - samples having different probabilities than others,</li>
<li>measurement error - difference between measurement & true value.</li>
</ul>
<p>Actually quantifying these is challenging, often impossible. <strong>However there is still value in thinking qualitatively about the sampling error, sampling bias or measurement error in your data</strong>.</p>
<p>Another useful concept is independent & identically distributed (IID). IID is the assumption that data is:</p>
<ul>
<li>independently sampled (no sampling bias),</li>
<li>identically distributed (no sampling or measurement error).</li>
</ul>
<p>It’s an assumption made in statistical learning about the quality of the distribution and sampling of data - and it’s almost always broken.</p>
<p>Thinking about the difference between the sampling & distribution of your training and test can help improve the generalization of a machine learning model, before it’s failing to generalize in production.</p>
<h1 id="bias--variance">Bias & Variance</h1>
<p>Prediction error of a supervised learning model has three components - bias, variance and noise.</p>
<p><strong>Bias is a lack of signal</strong> - the model misses seeing relationships that can be use to predict the target. This is underfitting. Bias can be reduced by increasing model capacity (either through more layers / trees, a different architecture or more features).</p>
<p><strong>Variance is confusing noise for signal</strong> - patterns in the training data that will not appear in the data at test time. This is overfitting. Variance can be reduced by adding training data.</p>
<p><strong>Noise is unmanageable</strong> - the best a model can do is avoid it.</p>
<p>The error of a machine learning model is usually due to a combination of all three. Often data scientists will be able to make changes that lead to a trade off between bias & variance. Three common levers a data scientist can pull are:</p>
<ul>
<li>adding model capacity,</li>
<li>reducing model capacity,</li>
<li>adding training data.</li>
</ul>
<h2 id="adding-model-capacity">Adding Model Capacity</h2>
<p>Increasing model capacity will reduce bias, but can increase variance (that additional capacity can be used to fit to noise).</p>
<h2 id="reducing-model-capacity">Reducing Model Capacity</h2>
<p>Decreasing model capacity (through regularization, dropout or a smaller model) will reduce variance but can increase bias.</p>
<h2 id="adding-data">Adding Data</h2>
<p>More data will reduce variance, because the model has more examples to learn how to separate noise from signal.</p>
<p>More data will have no effect on bias. <strong>More data can even make bias worse</strong>, if the sampling of additional is biased (sampling bias).</p>
<p>Additional data sampled with bias will only give your model the chance to be more precise about being wrong - see Chris Fonnesbeck’s talk on <a href="https:/www.youtube.com/watch?v=TGGGDpb04Yc">Statistical Thinking for Data Science</a> for more on the relationship between bias, sampling bias and data quantity.</p>
<h1 id="width--depth-of-neural-nets">Width & Depth of Neural Nets</h1>
<p>The reason why junior data scientists obsess over the architecture of fully connected neural networks comes from the process of building them. Constructing a neural network requires defining the architecture - surely it’s important?</p>
<p><strong>Yet when it comes to fully connected neural nets, the architecture isn’t really important</strong>.</p>
<p>As long as you give the model enough capacity and sensible hyperparameters, a fully connected neural network will be able to learn the same function with a variety of architectures. Let your gradients work with the capacity you give them.</p>
<p>Case in point is <em>Trust Region Policy Optimization</em>, which uses a simple feedforward neural network as a policy on locomotion tasks. The locomotion tasks use a flat input vector, with a simple fully connected architecture.</p>
<center><img width="80%" src="/assets/mistakes-data-sci/trpo.png" /></center>
<center><a href="https://arxiv.org/abs/1502.05477">Schulman et al. (2015) Trust Region Policy Optimization</a></center>
<p>The correct mindset with a fully connected neural network is a depth of two or three, with the width set between 50 to 100 (or 64 to 128, if you want to fit in with the cool computer science folk). If your model is low bias, consider adding capacity through another layer or additional width.</p>
<p>One interesting improvement on the simple fully connected architecture is the wide & deep architecture, <strong>which mixes wide memorization feature interactions with deep unseen, learned feature combinations</strong>.</p>
<center><img src="/assets/mistakes-data-sci/wide-deep.png" /></center>
<center><a href="https://arxiv.org/abs/1606.07792">Cheng et al. (2016) Wide & Deep Learning for Recommender Systems</a></center>
<h1 id="pep-8">PEP 8</h1>
<blockquote>
<p>Programs must be written for people to read, and only incidentally for machines to execute.</p>
<p>Abelson & Sussman - Structure and Interpretation of Computer Programs</p>
</blockquote>
<p>Code style is important. I remember being confused at why more experienced programmers were so particular about code style.</p>
<p><strong>After programming for five years, I now know where they were coming from</strong>.</p>
<p>Code that is laid out in the expected way requires less effort is required to read & understand code. Poor code style places additional burden on the reader to understand your unique code style, before they even think about the actual code itself.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># bad
</span><span class="n">Var</span><span class="o">=</span><span class="mi">1</span>
<span class="k">def</span> <span class="nf">adder</span> <span class="p">(</span> <span class="n">x</span> <span class="o">=</span><span class="mi">10</span> <span class="p">,</span><span class="n">y</span><span class="o">=</span> <span class="mi">5</span><span class="p">):</span>
<span class="k">return</span> <span class="n">x</span><span class="o">+</span><span class="n">y</span>
<span class="c1"># good
</span><span class="n">var</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">def</span> <span class="nf">adder</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="mi">5</span><span class="p">):</span>
<span class="k">return</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span>
</code></pre></div></div>
<p>All good text editors will have a way to integrate in-line linting - highlighting mistakes as you write them. <strong>Automatic, in-line linting is the best way to learn code style</strong> - take advantage of it.</p>
<h1 id="drop-the-target">Drop the Target</h1>
<p>If you ever get a model with an impossibly perfect performance, it is likely that your target is a feature.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># bad
</span><span class="n">data</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'target'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># good
</span><span class="n">data</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'target'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>We all do it once.</p>
<h1 id="scale-the-target-or-features">Scale the Target or Features</h1>
<p>This is the advice I’ve given most when debugging machine learning projects. Whenever I see a high loss (higher that say 2 or 3), it’s a clear sign that the target has not been scaled to a reasonable range.</p>
<p>Scale matters because <strong>unscaled targets lead to large prediction errors</strong>, which mean large gradients and unstable learning.</p>
<p>By scaling, I mean either standardization:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">standardized</span> <span class="o">=</span> <span class="p">(</span><span class="n">data</span> <span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">data</span><span class="p">))</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
</code></pre></div></div>
<p>Or normalization:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">normalized</span> <span class="o">=</span> <span class="p">(</span><span class="n">data</span> <span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="nb">min</span><span class="p">(</span><span class="n">data</span><span class="p">))</span> <span class="o">/</span> <span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="nb">max</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="nb">min</span><span class="p">(</span><span class="n">data</span><span class="p">))</span>
</code></pre></div></div>
<p>Note that there is a lack of consistency between what these things are called - normalization is also often called min-max scaling, or even standardization!</p>
<p>Take the example below, where we are trying to predict how many people attend a talk, from the number of speakers and the start time. Our first pipeline doesn’t scale the features or targets, leading to a large error signal and large gradients:</p>
<center><img src="/assets/mistakes-data-sci/scale1.png" width="900" /></center>
<p>Our second pipeline takes the time to properly scale features & target, leading to an error signal with appropriately sized gradients:</p>
<center><img src="/assets/mistakes-data-sci/scale2.png" width="900" /></center>
<p>A similar logic holds for features - unscaled features can dominate and distort how information flows through a neural network.</p>
<h1 id="work-with-a-sample">Work with a Sample</h1>
<p>This is a small workflow improvement that leads to massive productivity gains.</p>
<p>Development is a continual cycle of fixing errors, running code and fixing errors. Developing your program on a large dataset can cost you time - especially if your debugging something that happens at the end of the pipeline.</p>
<p><strong>During development, work on a small subset of the data</strong>. There are a few ways to handle this.</p>
<h2 id="creating-a-subset-of-the-data">Creating a Subset of the Data</h2>
<p>You can work on a sample of your data already in memory, using an integer index:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="o">=</span> <span class="n">data</span><span class="p">[:</span><span class="mi">1000</span><span class="p">]</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">pandas</code> allows you only load a subset of the data at a time (avoiding pulling the entire dataset into memory):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'data.csv'</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="mi">1000</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="controlling-the-debugging">Controlling the Debugging</h2>
<p>A simple way to control this is a variable - this is what you would do in a Jupyter Notebook:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">nrows</span> <span class="o">=</span> <span class="mi">1000</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'data.csv'</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="n">nrows</span><span class="p">)</span>
</code></pre></div></div>
<p>Or more cleanly with a command line argument:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># data.py
</span><span class="n">parser</span><span class="p">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">'--nrows'</span><span class="p">,</span> <span class="n">nargs</span><span class="o">=</span><span class="s">'?'</span><span class="p">)</span>
<span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="p">.</span><span class="n">parse_args</span><span class="p">()</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'data.csv'</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="n">args</span><span class="p">.</span><span class="n">nrows</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'loaded </span><span class="si">{</span><span class="n">data</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s"> rows'</span><span class="p">)</span>
</code></pre></div></div>
<p>Which can be controlled when running the script <code class="language-plaintext highlighter-rouge">data.py</code>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>python data.py <span class="nt">--nrows</span> 1000
</code></pre></div></div>
<h1 id="dont-write-over-raw-data">Don’t Write over Raw Data</h1>
<p>Raw data is holy - it should never be overwritten. The results of any data cleaning should be saved separately to the raw data.</p>
<h1 id="use-home">Use $HOME</h1>
<p>This one is a pattern that has dramatically simplified my life.</p>
<p><strong>Managing paths in Python can be tricky</strong>. There are few things that can change how path finding Python can work:</p>
<ul>
<li>where the user clones source code,</li>
<li>where a virtual environment installs that source code,</li>
<li>which directory a user runs a script from.</li>
</ul>
<p>Some of the problems that occur are from these changes:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">os.path.realpath</code> will change based on where the virtual environment installs your package,</li>
<li><code class="language-plaintext highlighter-rouge">os.getcwd</code> will change based on where the user runs Python the interpreter.</li>
</ul>
<p><strong>Putting data in a fixed, consistent place can avoid these issues</strong> - you don’t ever need to get the directory relative to anything except the users <code class="language-plaintext highlighter-rouge">$HOME</code> directory.</p>
<p>The solution is to create a folder in the user’s <code class="language-plaintext highlighter-rouge">$HOME</code> directory, and use it to store data:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="n">home</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">'HOME'</span><span class="p">]</span>
<span class="n">path</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">home</span><span class="p">,</span> <span class="s">'adg'</span><span class="p">))</span>
<span class="n">os</span><span class="p">.</span><span class="n">makedirs</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">np</span><span class="p">.</span><span class="n">save</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span>
</code></pre></div></div>
<p>This means your work is portable - both to on your colleague’s laptops and on remote machines in the cloud.</p>
<hr />
<p>Thanks for reading!</p>Adam Greenadam.green@adgefficiency.comBadges of honour for the accomplished data scientist.Daniel C. Dennett’s Four Competences2023-02-21T00:00:00+00:002023-02-21T00:00:00+00:00https://adgefficiency.com/four-competences<p>In <a href="https://en.wikipedia.org/wiki/From_Bacteria_to_Bach_and_Back">From Bacteria to Bach and Back</a> Daniel C. Dennett introduces <strong>four grades of competence</strong>.</p>
<p>They describe four <strong>progressively competent intelligences</strong>. Each competence learns through iterative application of trail and error learning.</p>
<center>
<img src="/assets/world-models/bach-bacteria.jpg" width="50%" />
</center>
<p></p>
<p>The four competences are an invaluable idea for understanding computational control algorithms.</p>
<p>They organize computational control algorithms by asymptotic performance and sample efficiency - the least efficient algorithms have lower limits on performance.</p>
<h3 id="what-is-competence">What is Competence?</h3>
<p><strong>Competence is the ability to act well</strong>. It is the ability of an agent to interact with its environment to achieve goals.</p>
<p><strong>Competence can be contrasted with comprehension, which is the ability to understand</strong>. Together both form a useful decomposition of intelligence.</p>
<p>Competence allows an agent to do control - to interact with a system and produce a desired outcome.</p>
<h3 id="evolutionary-learning">Evolutionary Learning</h3>
<blockquote>
<p>Maybe it would be good for hackers to act more like painters, and regularly start over from scratch</p>
<p>Paul Graham</p>
</blockquote>
<p>Evolutionary learning is trial and error learning.</p>
<p><strong>It is iterative improvement using a generate, test, select loop</strong>:</p>
<ul>
<li><strong>generate</strong> a population, using information from previous steps</li>
<li><strong>test</strong> the population through interaction with the environment</li>
<li><strong>select</strong> population members of the current generation for the next step</li>
</ul>
<p>It is the driving force in our universe and is substrate independent. It occurs in biological evolution, business, training neural networks, and personal development.</p>
<p>There is much to learn from evolutionary learning:</p>
<ul>
<li>failure at a low-level driving improvement at a higher level</li>
<li>the effectiveness of iterative improvement</li>
<li>the need of a dualistic (agent and environment) view for it to work, at odds with the truth of non-duality</li>
</ul>
<p>These are lessons to explore another time - for now, we are focused on the four grades of competence.</p>
<h3 id="comparing-competence">Comparing Competence</h3>
<p>There several metrics we can use to compare our intelligent agents.</p>
<p><strong>Asymptotic performance measures how an agent performs given unlimited opportunity to sample experience</strong>. It is how good an agent can be in the limit and improves as our agent gains more complex competences.</p>
<p><strong>Sample efficiency measures how much experience an agent needs to achieve a level of performance</strong>. This also improves as our agents get more complex. The importance of sample efficiency depends on compute cost. If compute is cheap, you care less about sample efficiency.</p>
<p>Each of the four agents interacts with the same environment. Interacting with the agent allows an agent to generate data through experience. What the agent does with this data determines how much data it needs. The more an agent squeezes out of each interaction, the less data required.</p>
<h2 id="the-four-competences">The Four Competences</h2>
<p>The four competences are successive applications of evolutionary learning - this means that each agent has all the abilities that the less competent agent had.</p>
<center>
<img src="/assets/four-competences/compt.png" width="80%" />
</center>
<p></p>
<h3 id="1-darwinian-competence">1. Darwinian Competence</h3>
<p>The Darwinian agent has pre-designed and fixed competences - it doesn’t improve within it’s lifetime.</p>
<p>Improvement happens globally via selection that aggregates across the agent’s entire lifetime.</p>
<p>Biological examples include bacteria and viruses. Computational examples include <a href="https://en.wikipedia.org/wiki/Cross-entropy_method">CEM</a>, evolutionary algorithms such as <a href="https://en.wikipedia.org/wiki/CMA-ES">CMA-ES</a> or genetic algorithms.</p>
<h3 id="2-skinnerian-competence">2. Skinnerian Competence</h3>
<p>The Skinnerian agent improves its behaviour by learning to responding to reinforcement. It can improve within it’s lifetime by learning how to map states and actions to reward signals, such as food or dopamine.</p>
<p>Biological examples include neurons and dogs. Computational examples include model-free reinforcement learning, such as <a href="https://en.wikipedia.org/wiki/Q-learning#Deep_Q-learning">DQN</a> or <a href="https://arxiv.org/abs/1710.02298">Rainbow</a>. The GPT series of language models has Skinnerian competence.</p>
<h3 id="3-popperian-competence">3. Popperian Competence</h3>
<p>The Popperian agent learns models of its environment - improvement occurs by offline testing of plans with its environment model.</p>
<p>Biological examples include crows and primates. Computational examples model-based reinforcement learning such as <a href="https://en.wikipedia.org/wiki/AlphaZero">AlphaZero</a> or <a href="https://worldmodels.github.io/">World Models</a> and classical optimal control.</p>
<h3 id="4-gregorian-competence">4. Gregorian Competence</h3>
<p>The Gregorian agent builds thinking tools, such as arithmetic, constrained optimization, democracy, and computers. Improvement occurs via systematic exploration and higher-order control of mental searches.</p>
<p>The only biological example we have of a Gregorian intelligence is humans. I do not know of a computational method that builds it’s own thinking tools. Now we have introduced our four agents we can compare them.</p>
<h2 id="comparing-the-four-competences">Comparing the Four Competences</h2>
<p>Darwinian agents improve through selection determined by a single number. For biological evolution, this is how many times an animal has mated.</p>
<p>For computational evolution, this is a fitness, such as average reward per episode. These are both weak learning signals. This accounts for the poor sample efficiency of agents with Darwinian competences.</p>
<p>Compare this with the Skinnerian agent, which can improve both through selection and reinforcement. Being able to respond to reinforcement allows within lifetime learning. It has the ability to learn from the temporal structure of the environment. The Skinnerian agent uses this data to learn functions that predict future rewards.</p>
<p>The Popperian agent can further improve within its lifetime by learning models of its world. Generating data from these models can be used for planning, or to produce low dimensional representations of the environment.</p>
<h2 id="summary">Summary</h2>
<p>Daniel C. Dennett’s four grades of competence describe four progressively competent intelligences, that each learn through successive applications of trial and error learning.</p>
<p>It allows understanding of the asymptotic performance and sample efficiency of learning algorithms and highlights two useful dimensions of intelligent agents - what data they use and what they learn from this data.</p>
<p>Of the most competent of our agents, humans are the only biological examples. We have no computational examples.</p>
<hr />
<p>Thanks for reading!</p>Adam Greenadam.green@adgefficiency.comA useful idea to understand computational control algorithms.Space Between Money and the Planet2023-02-08T00:00:00+00:002023-02-08T00:00:00+00:00https://adgefficiency.com/space-between-money-and-the-planet<p>This study proposes the existence of a <strong>tradeoff between monetary gain and carbon emissions reduction</strong> in the dispatch of electric batteries for arbitrage.</p>
<p>Supporting materials for this work are in <a href="https://github.com/ADGEfficiency/space-between-money-and-the-planet">adgefficiency/space-between-money-and-the-planet</a>.</p>
<center>
<img src="/assets/space-between/hero.png" />
</center>
<p><br /></p>
<p>A focus on economic profit is demonstrated to not result in maximum carbon savings. <strong>A focus only on wholesale prices often removes the entire carbon benefit</strong> and leads to a carbon emissions increase.</p>
<p>A calculation of the breakeven carbon price necessary to remove the tradeoff between prices and carbon is performed. <strong>This carbon price represents the price needed to align the world where we optimize for monetary gain with the world where we prioritize carbon reduction</strong>.</p>
<p>The calculation of the breakeven carbon price provides an estimate of the market correction required to reconcile the conflicting objectives of financial and environmental performance in the dispatch of electric batteries for arbitrage.</p>
<h1 id="motivation">Motivation</h1>
<h2 id="the-importance-of-battery-storage">The importance of battery storage</h2>
<p>Battery storage is a key technology of the clean energy transition. Batteries enable low carbon, intermittent renewable generation to replace dirty electricity.</p>
<p><strong>Batteries pose a different set of control problems</strong> than other key energy transition technologies like solar or wind.</p>
<p>A battery makes decisions to charge or discharge based on an imperfect view of the world, with competing objectives and value streams.</p>
<p>Once a wind turbine or solar panel is built, operating that asset is straightforward - you generate as much as you can based on the amount of wind or sun available at that moment. There is no decision to make or opportunity cost to trade off - when the resource is available, you use as much as possible.</p>
<h3 id="arbitrage-of-money-and-carbon">Arbitrage of money and carbon</h3>
<p>A common battery operation stragety is arbitrage - the movement of electricity between periods of high and low value.</p>
<p>In the price arbitrage scenario, a battery wants to purchase cheap electricity and sell it at a higher price. A battery that does the opposite, that charges when electricity prices are high and discharges when they are low, will lose money.</p>
<p>A battery that charges with dirty electricity and discharges when electricity is clean increases carbon emissions. Charging increases the load on a dirtier generator, while discharging decreases the load on a cleaner generator.</p>
<h2 id="tradeoff-between-profit-maximization-and-emissions-minimization">Tradeoff between profit maximization and emissions minimization</h2>
<p>Operating a battery requires making decisions to achieve a goal. Two natural goals for a battery are to maximize profit or save carbon.</p>
<p>A central point of this work is that we cannot rely only on optimization driven only by price signals to maximize carbon savings.</p>
<p>This view was shared in 2022 by <a href="https://www.economist.com/leaders/2022/02/12/the-truth-about-dirty-assets">The Economist</a>:</p>
<blockquote>
<p>Many funds claim that there is no trade-off between maximising profits and green investing, which seems unlikely for as long as the externalities created by polluting firms are legal and untaxed.</p>
</blockquote>
<h2 id="the-just-make-money-fallacy">The ‘just make money’ fallacy</h2>
<p>In my career I’ve personally held and often encountered the following perspective:</p>
<blockquote>
<p>Environmentally effective climate action must be economically effective - we need to make money in order to save the planet.</p>
</blockquote>
<p>It’s often backed up with the view that renewables are low variable cost generators, able to bid into electricity markets at lower prices than high variable cost generators (like gas and coal).</p>
<p>This viewpoint (and viewpoints similar to it) are convenient - just make money, ignore the carbon side and you are also saving the planet.</p>
<h1 id="methods">Methods</h1>
<p><a href="https://github.com/ADGEfficiency/space-between-money-and-the-planet">Experiment source code is here</a>.</p>
<h2 id="experiment-design">Experiment design</h2>
<ol>
<li>Join raw price and carbon intensity data.</li>
<li>Simulate battery with objectives of:
a. profit maximization,
b. carbon emissions minimization,</li>
<li>Compare the economic and carbon benefits of the two objective.</li>
</ol>
<h3 id="re-run-the-experiment">Re-run the experiment</h3>
<p>Requires Python 3.10+ - the command <code class="language-plaintext highlighter-rouge">make results</code> will re-run the entire experiment including downloading & joining the raw data and running the simulations for price and carbon objectives:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git clone https://github.com/ADGEfficiency/space-between
<span class="nv">$ </span><span class="nb">cd </span>space-between
<span class="nv">$ </span>make results
</code></pre></div></div>
<h2 id="signals-and-worlds">Signals and worlds</h2>
<p>The key idea in the methodology is to take the difference between two worlds - a world where we optimize for money, and a world where we optimize for price.</p>
<p><strong>In an ideal world, we would be able to operate a battery to both make money and save carbon at the same time.</strong> If clean electricity is cheap and dirty electricity is expensive, we can operate our battery to make money, and know that we will also be saving carbon.</p>
<p><strong>In the opposite world, where dirty electricity is cheap and clean electricity is expensive, there is an opportunity cost to saving carbon</strong>. There would be situations where you would need to reduce the environmental benefit of operating your battery in order to make more money.</p>
<p>Below is a scenario where there is an opportunity cost to saving carbon. We can measure the delta between these two worlds in terms of the two things we care about - money and carbon.</p>
<p>Choosing to prioritize money over carbon means we make <code class="language-plaintext highlighter-rouge">$150</code> more than if we optimized for carbon, but we generate <code class="language-plaintext highlighter-rouge">10 tC</code> more than if we optimized for carbon:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Optimize for Money</th>
<th>Optimize for Carbon</th>
<th>Delta</th>
</tr>
</thead>
<tbody>
<tr>
<td>Money saved $</td>
<td>200</td>
<td>50</td>
<td>150</td>
</tr>
<tr>
<td>Carbon saved tC</td>
<td>10</td>
<td>20</td>
<td>10</td>
</tr>
<tr>
<td> </td>
<td> </td>
<td><strong>Carbon Price $/tC</strong></td>
<td>15</td>
</tr>
</tbody>
</table>
<p>Looking at the delta between our two worlds allows us to calculate a carbon price of <code class="language-plaintext highlighter-rouge">15 $/tC</code>. This carbon price is the ratio of money gained by optimizing for money to the carbon saving gained by optimizing for carbon.</p>
<p><strong>We would be giving the market <code class="language-plaintext highlighter-rouge">$150</code> to balance out what we lose when optimizing for carbon, and receive <code class="language-plaintext highlighter-rouge">10 tC</code> of carbon savings in for our lost money.</strong></p>
<p>This carbon price would be applied in proportion to the carbon intensity of the electricity produced by each market participant.</p>
<p>This price estimates the level of support (via a revenue neutral carbon tax on electricity market participants - of course!) required to counteract the misalignment between the price and carbon signals and worlds.</p>
<h2 id="data">Data</h2>
<p>This study uses data from the Australian National Electricity Market (NEM) from 2014 to end of 2022.</p>
<p>This experiment uses two signals as input interval data - a price signal and a carbon signal.</p>
<p>The price signal is the 5 minute dispatch prices in South Australia. This is a slightly different dataset than the trading price. Dispatch prices were chosen so that the prices (before and after the transition from 30 to 5 minute trading price settlement) is on the same frequency (5 minutes per interval) as the carbon intensity data.</p>
<p>The carbon signal is the 5 minute NEMDE data and NEM generator carbon intensity in South Australia. The NEMDE dataset has data on the marginal carbon generators, which allows calculation of a marginal carbon intensity.</p>
<h2 id="dependencies">Dependencies</h2>
<p>The main third-party Python dependencies of this work are <code class="language-plaintext highlighter-rouge">pandas</code> for data processing, <code class="language-plaintext highlighter-rouge">matplotlib</code> for plotting and <code class="language-plaintext highlighter-rouge">pulp</code> for linear program solving.</p>
<p>This work depends on <a href="">nem-data</a> - a Python CLI for downloading Australian electricity market data:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">nemdata</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">nemdata</span><span class="p">.</span><span class="n">download</span><span class="p">(</span><span class="n">start</span><span class="o">=</span><span class="s">"2020-01"</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">"2020-02"</span><span class="p">,</span> <span class="n">table</span><span class="o">=</span><span class="s">"trading-price"</span><span class="p">)</span>
</code></pre></div></div>
<p>This work depends on <a href="">energy-py-linear</a> - a Python library for optimizing the dispatch of energy assets for profit maximization and carbon emissions reduction:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">energypylinear</span> <span class="k">as</span> <span class="n">epl</span>
<span class="c1"># 2.0 MW, 4.0 MWh battery
</span><span class="n">asset</span> <span class="o">=</span> <span class="n">epl</span><span class="p">.</span><span class="n">battery</span><span class="p">.</span><span class="n">Battery</span><span class="p">(</span><span class="n">power_mw</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">capacity_mwh</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">efficiency</span><span class="o">=</span><span class="mf">1.0</span><span class="p">)</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">asset</span><span class="p">.</span><span class="n">optimize</span><span class="p">(</span>
<span class="n">electricity_prices</span><span class="o">=</span><span class="p">[</span><span class="mf">100.0</span><span class="p">,</span> <span class="mi">50</span><span class="p">,</span> <span class="mi">200</span><span class="p">,</span> <span class="o">-</span><span class="mi">100</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">200</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="o">-</span><span class="mi">100</span><span class="p">],</span> <span class="n">freq_mins</span><span class="o">=</span><span class="mi">5</span>
<span class="p">)</span>
</code></pre></div></div>
<h2 id="battery-model">Battery model</h2>
<p>The battery model is a mixed-integer linear program built in PuLP. It optimizes the charge and discharge of a battery with perfect foresight of future prices and marginal carbon intensities. The roundtrip efficiency of the battery is set at 100%.</p>
<p>The only value stream available to the battery is the arbitrage of electricity or carbon from one interval to another. The battery is optimized in monthly blocks with interval data on a 5 minute frequency.</p>
<h1 id="results">Results</h1>
<p>Download previously generated results with Python 3 using <code class="language-plaintext highlighter-rouge">make pulls3</code>:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git clone https://github.com/ADGEfficiency/space-between
<span class="nv">$ </span><span class="nb">cd </span>space-between
<span class="nv">$ </span>make pulls3
</code></pre></div></div>
<p>This pull previously generated results from S3 using the AWS CLI into <code class="language-plaintext highlighter-rouge">./data</code>:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>tree <span class="nt">-L</span> 3 ./data
├── database.sqlite
├── dataset.parquet
└── results
├── 08cdcee2-a315-49d8-9207-820a5ad4a0de
│ ├── input-interval-data.parquet
│ ├── interval-data.parquet
│ ├── meta.json
│ └── simulation.parquet
├── 0bd87681-d422-491c-9ad2-3afc1503ab6f
│ ├── input-interval-data.parquet
│ ├── interval-data.parquet
│ ├── meta.json
│ └── simulation.parquet
...
└── fab33244-fc60-456a-8377-f5f73c2700d7
├── input-interval-data.parquet
├── interval-data.parquet
├── meta.json
└── simulation.parquet
</code></pre></div></div>
<h2 id="optimize-for-price-or-carbon">Optimize for price or carbon</h2>
<p>The battery model was optimized on one of two objectives - either price or carbon.</p>
<p>Optimizing for price means the battery will import electricity from the grid at low prices and export it during high prices, leading to an economic saving.</p>
<p>Optimizing for carbon means the battery will import electricity from the grid at low marginal carbon intensity and export it during high marginal carbon intensity, leading to a carbon saving.</p>
<p>Below we compare the optimization of battery for these two objectives - the left optimizes a battery for money, on the right optimizing a battery for carbon:</p>
<p><img src="/assets/space-between-2023/panel.png" alt="" /></p>
<center><figcaption>Comparing the optimization for price (left) and carbon (right).</figcaption></center>
<p><br /></p>
<p>We can observe the full use of the battery charge in both the price and carbon arbitrage simulations.</p>
<h2 id="monthly-profit-and-emissions-benefits">Monthly profit and emissions benefits</h2>
<p>We can look at how our simulations are performing across the entire experiment by grouping our simulations by month.</p>
<p>A negative benefit is a loss. Negative profit means losing money, negative carbon benefit means increasing carbon emissions.</p>
<p>The chart below shows the price & carbon benefit from optimizing our battery for price and carbon for each month:</p>
<p><img src="/assets/space-between-2023/monthly-benefit.png" alt="" /></p>
<center><figcaption>Monthly price & carbon benefits when optimizing for price (left) and carbon (right) from 2014 to end of 2020.</figcaption></center>
<p>The table below summarize the data across the entire experiment:</p>
<table>
<thead>
<tr>
<th style="text-align: left">objective</th>
<th style="text-align: right">negative_profit</th>
<th style="text-align: right">negative_emissions_benefit</th>
<th style="text-align: right">months</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">carbon</td>
<td style="text-align: right">73.1481</td>
<td style="text-align: right">0</td>
<td style="text-align: right">108</td>
</tr>
<tr>
<td style="text-align: left">price</td>
<td style="text-align: right">0</td>
<td style="text-align: right">87.037</td>
<td style="text-align: right">108</td>
</tr>
</tbody>
</table>
<p>When we optimize for money, we have a negative effect on the environment <code class="language-plaintext highlighter-rouge">87%</code> of the time. When we optimize for carbon, we will lose money <code class="language-plaintext highlighter-rouge">84.5%</code> of the time.</p>
<p>These results are dramatic - changing our objective can often completely remove the benefit we see for the alternate objective.</p>
<h2 id="monthly-carbon-price">Monthly carbon price</h2>
<p>What we are interested in is how these two simulations change together - by taking the difference between the two simulations (one for money, the other for carbon), we can measure how far the space is between them.</p>
<p>The chart below shows the data grouped by month, but this time only shows the delta between our two worlds:</p>
<p><img src="/assets/space-between-2023/monthly.png" alt="" /></p>
<center><figcaption>Monthly deltas from 2014 to end of 2020.</figcaption></center>
<p>The three deltas shown above are:</p>
<ul>
<li><strong>price delta</strong> - the difference between the optimize for money and optimize for carbon worlds in thousands of Australian dollars per month,</li>
<li><strong>carbon delta</strong> - the difference between the optimize for money and optimize for carbon worlds in term of tons of carbon savings per month,</li>
<li><strong>monthly carbon price</strong> - the ratio of our price to carbon deltas.</li>
</ul>
<h2 id="annual-carbon-price">Annual carbon price</h2>
<p>The final chart shows the delta between worlds results grouped by year:</p>
<p><img src="/assets/space-between-2023/annual.png" alt="" /></p>
<center><figcaption>Annual deltas from 2014 to end of 2020.</figcaption></center>
<p>We can observe a few things from the chart above:</p>
<ul>
<li>a carbon price of below <code class="language-plaintext highlighter-rouge">80 $/tC</code> would fully correct for the misalignment between price and carbon signals in all years except 2022,</li>
<li>2022 is an outlier due to both an increased price delta (meaning the electricity market was more valuable for batteries) and a lower carbon delta (due to cleaner electricity).</li>
</ul>
<h1 id="discussion">Discussion</h1>
<h2 id="exploring-carbon-prices">Exploring carbon prices</h2>
<p>A key result of this work is the estimation of the breakeven carbon intensity between our two simulated worlds.</p>
<p>A system where our deltas are <code class="language-plaintext highlighter-rouge">$500</code> and <code class="language-plaintext highlighter-rouge">50 tC</code> results in a carbon price of <code class="language-plaintext highlighter-rouge">$/tC 10</code>.</p>
<p>This carbon prices implies that if we adjust our market by collecting this <code class="language-plaintext highlighter-rouge">$500</code> through a carbon price applied to all generation, we could incentivize lower carbon generation to be more competitive at the margin.</p>
<p>This carbon price is a break-even carbon price for the battery - it is what we would have to pay the market to offset the lost revenue of <code class="language-plaintext highlighter-rouge">$500</code>.</p>
<h2 id="more-output-metrics">More Output Metrics</h2>
<p>This study stops with the calculation of a carbon delta, which is reducing over time. This means that even if the carbon price was increasing, the total cost may be decreasing. The total cost is the carbon delta multiplied by the breakeven carbon price.</p>
<h2 id="effect-of-efficiency--forecast-error-on-carbon-price">Effect of efficiency & forecast error on carbon price</h2>
<p>The optimization done in this work is with perfect foresight. Optimizing with perfect foresight allows us to put an upper limit on both money and carbon savings. In reality, a battery will be operated with imperfect foresight of future prices.</p>
<p>Because we are interested in the ratio between carbon & economic savings, taking the ratio of maximum carbon to maximum economic savings is hopefully useful. The assumption is that the relative dispatch error (in % lost carbon or money) is the same for both objectives.</p>
<h2 id="data-1">Data</h2>
<p>This study uses the 5 minute South Australia dispatch price and the 5 minute NEMDE data for a carbon signal.</p>
<p>Using different price and carbon signals will change the results of this study - this isn’t a fatal criticism but it should reinforce that this study is heavily dependent on the choice of data.</p>
<p>We can add to this the generic but always relevant criticism of anything empirical - you can’t use the past to predict the future.</p>
<h2 id="marginal-versus-average-carbon-intensity">Marginal versus average carbon intensity</h2>
<p>The intensity from the NEMDE data is a marginal intensity, supplied by the NEMDE solver as the slack variable for increasing demand.
By using this signal we are assuming that any actions we took would not change how the market is dispatched - this will be true up to a point (the size of the marginal bid).</p>
<p>The marginal carbon intensity is different from the <a href="https://adgefficiency.com/energy-basics-average-vs-marginal-carbon-emissions/">more commonly reported average carbon intensity</a>. It would be interesting to compare these results with different carbon signals.</p>
<p>It does introduce the question of which intensity is relevant for the accounting.</p>
<h2 id="battery-model-1">Battery model</h2>
<p>The battery model applies a constant roundtrip efficiency onto battery export - in reality efficiency is a non-linear function of state of charge, battery age and temperature.</p>
<p>This study uses a battery configuration of 2 MW power rating with 4 MWh of capacity - other batteries have different ratios of power to energy.</p>
<h2 id="single-value-stream">Single value stream</h2>
<p>Batteries often have access to many value streams, such as network charge savings or grid frequency services. This experiment only considers the arbitrage of wholesale electricity.</p>
<p>Including other value streams will change the size of the delta between our two worlds.</p>
<hr />
<p><strong>Thanks for reading!</strong></p>
<p>If you enjoyed the content of post, check out <a href="https://adgefficiency.com/energy-py-linear-forecast-quality/">Measuring Forecast Quality using Linear Programming</a>, which uses a linear programming battery model to measure the quality of a forecast.</p>
<p>If you enjoyed the style of this post, check out <a href="https://adgefficiency.com/typical-year-forecasting-electricity-prices/">Typical Year Forecasting of Electricity Prices </a>, which shows how to create a low variance forecasts and estimates of energy project performance.</p>
<p>Supporting materials for this work are in <a href="https://github.com/ADGEfficiency/space-between-money-and-the-planet">adgefficiency/space-between-money-and-the-planet</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{green2023spacebetween,
title = "The Space Between Money and the Planet",
author = "Green, Adam Derek",
journal = "adgefficiency.github.io",
year = "2023",
url = "https://adgefficiency.com/space-between-money-and-the-planet/"
}
</code></pre></div></div>Adam Greenadam.green@adgefficiency.comThe opportunity cost for using batteries to reduce carbon emissions.Introducing energy-py-linear2023-01-30T00:00:00+00:002023-01-30T00:00:00+00:00https://adgefficiency.com/intro-energy-py-linear<p>This post introduces <a href="https://github.com/ADGEfficiency/energy-py-linear">energy-py-linear</a> - a Python library for optimizing energy assets using mixed integer linear programming (MILP).</p>
<h2 id="why-linear-programming">Why Linear Programming?</h2>
<p>Linear programming is a popular choice for solving many energy industry problems - many energy systems can be modelled as linear, and suitable for optimization using linear solvers.</p>
<p>Linear models have the quality that if a feasible solution exists, it exists on the boundary of a constraint. This makes solving linear programs fast in practice. The optimization itself is also deterministic - it doesn’t rely on randomness like gradient descent.</p>
<h2 id="what-can-energypylinear-do">What can <code class="language-plaintext highlighter-rouge">energypylinear</code> do?</h2>
<ol>
<li>optimize the dispatch of electric batteries, electric vehicle charging and gas fired CHP generators,</li>
<li>optimize for either price or carbon,</li>
<li>calculate the variance between two simulations.</li>
</ol>
<p>You can find the source code for <code class="language-plaintext highlighter-rouge">energypylinear</code> at <a href="https://github.com/ADGEfficiency/energy-py-linear">ADGEfficiency/energy-py-linear</a>.</p>Adam Greenadam.green@adgefficiency.comA Python library for optimizing energy systems using mixed integer linear programming.A Guide to Deep Learning Layers2023-01-23T00:00:00+00:002023-01-23T00:00:00+00:00https://adgefficiency.com/guide-deep-learning<h1 id="summary">Summary</h1>
<table>
<thead>
<tr>
<th>Layer</th>
<th>Intuition</th>
<th>Inductive Bias</th>
<th>When To Use</th>
</tr>
</thead>
<tbody>
<tr>
<td>Fully connected</td>
<td>Allow all possible connections</td>
<td>None</td>
<td>Data without structure (tabular data)</td>
</tr>
<tr>
<td>2D convolution</td>
<td>Recognizing spatial patterns</td>
<td>Local, spatial patterns</td>
<td>Data with spatial structure (images)</td>
</tr>
<tr>
<td>LSTM</td>
<td>Database</td>
<td>Sequences & memory</td>
<td>Never - use Attention</td>
</tr>
<tr>
<td>Attention</td>
<td>Focus on similarities</td>
<td>Similarity & limit information flow</td>
<td>Data with sequential structure</td>
</tr>
</tbody>
</table>
<h1 id="introduction">Introduction</h1>
<p><strong>This post is about four fundamental neural network layer architectures</strong> - the building blocks that machine learning engineers use to construct deep learning models.</p>
<p>The four layers are:</p>
<ol>
<li>the fully connected layer,</li>
<li>the 2D convolutional layer,</li>
<li>the LSTM layer,</li>
<li>the attention layer.</li>
</ol>
<p>For each layer we will look at:</p>
<ul>
<li><strong>how each layer works</strong>,</li>
<li>the <strong>intuition</strong> behind each layer,</li>
<li>the <strong>inductive bias</strong> of each layer,</li>
<li>what the <strong>important hyperparameters</strong> are for each layer,</li>
<li><strong>when to use</strong> each layer,</li>
<li><strong>how to program</strong> each layer in TensorFlow 2.0.</li>
</ul>
<p>All code examples are built using <code class="language-plaintext highlighter-rouge">tensorflow==2.2.0</code> using the Keras Functional API.</p>
<h2 id="background---what-is-inductive-bias">Background - what is inductive bias?</h2>
<p>A key term in this article is <strong>inductive bias</strong> - a useful term to sound clever and impress your friends.</p>
<p><strong>Inductive bias is the hard-coding of assumptions into the structure of a learning algorithm</strong>. These assumptions make the method <strong>more special purpose, less flexible but more useful</strong>. By hard coding in assumptions about the structure of the data & task, we can learn functions that we otherwise can’t.</p>
<p>Examples of inductive bias in machine learning include margin maximization (classes should be separated by as large a boundary as possible - used in Support Vector Machines) and nearest neighbours (samples close together in feature space are in the same class - used in the k-nearest neighbours algorithm).</p>
<p><strong>A bit of bias is good</strong> - this is a common lesson in machine learning (bias can be traded off for variance). This also holds in reinforcement learning, where unbiased approxmiations of a high variance Monte Carlo return performs worse than bootstrapped temporal difference methods.</p>
<p><br /></p>
<h1 id="1-the-fully-connected-layer">1. The Fully Connected Layer</h1>
<p><strong>The fully connected layer is the most general purpose deep learning layer</strong>.</p>
<p>Also known as a dense or feed-forward layer, this layer imposes the <strong>least amount of structure</strong> of our layers. It will be found in almost all neural networks - if only used to control the size & shape of the output layer.</p>
<h2 id="how-does-the-fully-connected-layer-work">How does the fully connected layer work?</h2>
<p>At the heart of the fully connected layer is the artificial neuron - the distant ancestor of McCulloch & Pitt’s <em>Threshold Logic Unit</em> of 1943.</p>
<p><strong>The artificial neuron is inspired by the biological neurons in our brains</strong> - however an artificial neuron is a shallow approximation of the complexity of a biological neuron.</p>
<p>The artificial neuron composed of three sequential steps:</p>
<ol>
<li>weighted linear combination of inputs,</li>
<li>sum across weighted inputs,</li>
<li>activation function.</li>
</ol>
<h3 id="1-weighted-linear-combination-of-inputs">1. Weighted linear combination of inputs</h3>
<p>The strength of the connection between nodes in different layers are controlled by weights - the shape of these weights depending on the number of nodes layers on either side. Each node has an additional parameter known as a bias, which can be used to shift the output of the node independently of it’s input.</p>
<p>The weights and biases are learnt - commonly in modern machine learning backpropagation is used to find good values of these weights - good values being those that lead to good predictive accuracy of the network on unseen data.</p>
<h3 id="2-sum-across-all-weighted-inputs">2. Sum across all weighted inputs</h3>
<p>After applying the weight and bias, all of the inputs into the neuron are summed together to a single number.</p>
<h3 id="3-activation-function">3. Activation function</h3>
<p>This is then passed through an activation function. The most important activation functions are:</p>
<ul>
<li><strong>linear</strong> activation function - unchanged output,</li>
<li><strong>ReLu</strong> - $0$ if the input is negative, otherwise input is unchanged</li>
<li><strong>Sigmoid</strong> squashes the input to the range $(0, 1)$</li>
<li><strong>Tanh</strong> squashes the input to the range $(-1, 1)$</li>
</ul>
<p>The output of the activation function is input to all neurons (also known as nodes or units) in the next layer.</p>
<p><strong>This is where the fully connected layer gets it’s name from - each node is fully connected to the nodes in the layers before & after it</strong>.</p>
<center><img align="center" src="/assets/four-dl-arch/neuron.png" /></center>
<p align="center"><i>A single neuron with a ReLu activation function</i></p>
<p>For the first layer, the node gets it’s input from the data being fed into the network (each data point is connected to each node). For last layer, the output is the prediction of the network.</p>
<center><img align="center" src="/assets/four-dl-arch/dense.png" /></center>
<p align="center"><i>The fully connected layer</i></p>
<h2 id="what-is-the-intuition--inductive-bias-of-a-fully-connected-layer">What is the intuition & inductive bias of a fully connected layer?</h2>
<p>The intuition behind all the connections in a fully connected layer is to put <strong>no restriction on information flow</strong>. It’s the intuition of having no intuition.</p>
<p>The fully connected layer imposes no structure and makes no assumptions about the data or task the network will perform. <strong>A neural network built of fully connected layers can be thought of as a blank canvas</strong> - impose no structure and let the network figure everything out.</p>
<h3 id="universal-approximation-except-in-practice">Universal Approximation (Except in Practice)</h3>
<p><strong>This lack of structure is what gives neural networks of fully connected layers (of sufficient depth & width) the ability to approximate any function</strong> - known as the Universal Approximation Theorem.</p>
<p>The ability to learn any function at first sounds attractive. Why do we need any other architecture if a fully connected layer can learn anything?</p>
<p><strong>Being able to learn in theory does not mean we can learn in practice</strong>. Actually finding the correct weights, using the data and learning algorithms (such as backpropagation) we have available may be impractical and unreachable.</p>
<p>The solution to these practical challenges is to use less specialized layers - layers that have assumptions about the data & task they are expected to perform. <strong>This specialization is their inductive bias</strong>.</p>
<h2 id="when-should-i-use-a-fully-connected-layer">When should I use a fully connected layer?</h2>
<p>A fully connected layer is the most general deep learning architecture - it imposes no constraints on the connectivity of each laver.</p>
<p><strong>Use it when your data has no structure that you can take advantage of</strong> - if your data is a flat array (common in tabular data problems), then a fully connected layer is a good choice. Most neural networks will have fully connected layers somewhere.</p>
<p>Fully connected layers are common in reinforcement learning when learning from a flat environment observation.</p>
<p>For example, a network with a single fully connected layer is used in the Trust Region Policy Optimization (TRPO) paper from 2015:</p>
<center><img align="center" width="50%" src="/assets/mistakes-data-sci/trpo.png" /></center>
<p align="center"><i>A fully connected layer being used to power the reinforcement learning algorithm TRPO</i></p>
<p>Fully connected layers are common as the penultimate & final layer as fully connected on convolutional neural networks performing classification. The number of units in the fully connected output layer will be equal to the number of classes, with a softmax activation function used to create a distribution over classes.</p>
<h2 id="what-hyperparameters-are-important-for-a-fully-connected-layer">What hyperparameters are important for a fully connected layer?</h2>
<p>The two hyperparameters you’ll often set in a fully connected layer are the:</p>
<ol>
<li><strong>number of nodes</strong>,</li>
<li><strong>activation function</strong>.</li>
</ol>
<p>A fully connected layer is defined by a number of nodes (also known as units), each with an activation function. While you could have a layer with different activation functions on different nodes, most of the time each node in a layer has the same activation function.</p>
<p>For hidden layers, the <strong>most common choice of activation function is the rectified-linear unit (the ReLu)</strong>. For the output layer, the correct activation function depends on what the network is predicting:</p>
<ul>
<li>regression, target can be positive or negative -> linear (no activation),</li>
<li>regression, target can be positive only -> ReLu,</li>
<li>classification -> Softmax,</li>
<li>control action, bound between -1 & 1 -> Tanh.</li>
</ul>
<h2 id="using-fully-connected-layers-with-the-keras-functional-api">Using fully connected layers with the Keras Functional API</h2>
<p>Below is an example of how to use a fully connected layer with the Keras functional API.</p>
<p>We are using input data shaped like an image, to show the flexibility of the fully connected layer - this requires us to use a <code class="language-plaintext highlighter-rouge">Flatten</code> layer later in the network:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="kn">from</span> <span class="nn">tensorflow.keras</span> <span class="kn">import</span> <span class="n">Input</span><span class="p">,</span> <span class="n">Model</span>
<span class="kn">from</span> <span class="nn">tensorflow.keras.layers</span> <span class="kn">import</span> <span class="n">Dense</span><span class="p">,</span> <span class="n">Flatten</span>
<span class="c1"># the least random of all random seeds
</span><span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="n">tf</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">set_seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="c1"># dataset of 4 samples, 32x32 with 3 channels
</span><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
<span class="n">inp</span> <span class="o">=</span> <span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">:])</span>
<span class="n">hidden</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)(</span><span class="n">inp</span><span class="p">)</span>
<span class="n">flat</span> <span class="o">=</span> <span class="n">Flatten</span><span class="p">()(</span><span class="n">hidden</span><span class="p">)</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">2</span><span class="p">)(</span><span class="n">flat</span><span class="p">)</span>
<span class="n">mdl</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="n">inp</span><span class="p">,</span> <span class="n">outputs</span><span class="o">=</span><span class="n">out</span><span class="p">)</span>
<span class="n">mdl</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="s">"""
<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[ 0.23494382, -0.40392348],
[ 0.10658629, -0.31808627],
[ 0.42371386, -0.46299127],
[ 0.34416917, -0.11493915]], dtype=float32)>
"""</span>
</code></pre></div></div>
<p><br /></p>
<h1 id="2-the-2d-convolutional-layer">2. The 2D Convolutional Layer</h1>
<p><strong>If you had to pick one architecture as the most important in deep learning, it’s hard to look past convolution</strong> (see what I did there?).</p>
<p>The winner of the 2012 ImageNet competition, AlexNet, is seen by many as the start of modern deep learning. Alexnet was a deep convolutional neural network, trained on GPU to classify images.</p>
<p>Another landmark use of convolution is Le-Net-5 in 1998, a 7 layer convolutional neural network developed by Yann LeCun to classify handwritten digits.</p>
<p><strong>The convolutional neural network is the original workhorse of the modern deep learning revolution</strong> - it can be used with text, audio, video and images.</p>
<p>Convolutional neural networks can be used to classify the contents of the image, recognize faces and create captions for images. They are also easy to parallelize on GPU - making them fast to train.</p>
<h2 id="what-is-the-intuition-and-inductive-bias-of-convolutional-layers">What is the intuition and inductive bias of convolutional layers?</h2>
<p>Convolution itself is a mathematical operation, commonly used in signal processing. The 2D convolutional layer is inspired by our own visual cortex.</p>
<p>The history of using convolution in artificial neural networks goes back decades to the neocognitron, an architecture introduced by Kunihiko Fukushima in 1980, inspired by the work of Hubel & Wiesel.</p>
<p>Work by the neurophysiologists Hubel & Wiesel in the 1950’s showed that individual neurons in the visual cortexes of mammals are activated by small regions of vision.</p>
<center><img align="center" width="50%" src="/assets/four-dl-arch/hubel.jpg" /></center>
<p align="center"><i>Hubel & Wiesel</i></p>
<p><strong>A good mental model for convolution is the process of sliding a filter over a signal, at each point checking to see how well the filter matches the signal</strong>.</p>
<p>This checking process is pattern recognition, and is the intuition behind convolution - looking for small, spatial patterns anywhere in a larger space. <strong>The convolution layer has inductive bias for recognizing local, spatial patterns</strong>.</p>
<h2 id="how-does-a-2d-convolution-layer-work">How does a 2D convolution layer work?</h2>
<p>A 2D convolutional layer is defined by the interaction between two components:</p>
<ol>
<li>a 3D image, with shape <code class="language-plaintext highlighter-rouge">(height, width, color channels)</code>,</li>
<li>a 2D filter, with shape <code class="language-plaintext highlighter-rouge">(height, width)</code>.</li>
</ol>
<p>The intuition of convolution is looking for patterns in a larger space.</p>
<p><strong>In a 2D convolutional layer, the patterns we are looking for are filters, and the larger space is an image</strong>.</p>
<h3 id="filters">Filters</h3>
<p><strong>A convolutional layer is defined by it’s filters</strong>. These filters are learnt - they are equivalent to the weights of a fully connected layer.</p>
<p>Filters in the first layers of a convolutional neural network detect simple features such as lines or edges. Deeper in the network, filters can detect more complex features that help the network perform it’s task.</p>
<p>To further understand how these filters work, let’s work with a small image and two filters. The basic operation in a convolutional neural network is to use these filters to detect patterns in the image, by performing element-wise multiplication and summing the result:</p>
<center><img align="center" width="75%" src="/assets/four-dl-arch/filters.png" /></center>
<p align="center"><i>Applying different filters to a small image</i></p>
<p><strong>Reusing the same filters over the entire image allows features to be detected in any part of the image - a property known as translation invariance</strong>. This property is ideal for classification - you want to detect a cat no matter where it occurs in the image.</p>
<p>For larger images (which are often <code class="language-plaintext highlighter-rouge">32x32</code> or larger), this same basic operation is performed, with the filter being passed over the entire image. The output of this operation acts as feature detection, for the filters that the network has learnt, producing a 2D feature map.</p>
<center><img align="center" src="/assets/four-dl-arch/conv.png" /></center>
<p align="center"><i>A filter producing a filter map by convolving over an image</i></p>
<p>The feature maps produced by each filter are concatenated, resulting in a 3D volume (the length of the third dimension being the number of filters).</p>
<p>The next layer then performs convolution over this new volume, using a new set of learned filters.</p>
<center><img align="center" width="75%" src="/assets/four-dl-arch/map.png" /></center>
<p align="center"><i>The feature maps of multiple filters are concatenated to produce a volume, which is passed to the next layer.</i></p>
<h2 id="2d-convolutional-neural-network-built-using-the-keras-functional-api">2D convolutional neural network built using the Keras Functional API</h2>
<p>Below is an example of how to use a 2D convolution layer with the Keras functional API:</p>
<ul>
<li>the <code class="language-plaintext highlighter-rouge">Flatten</code> layer before the dense layer, to flatten our volume produced by the 2D convolutional layer,</li>
<li>the <code class="language-plaintext highlighter-rouge">Dense</code> layer size of <code class="language-plaintext highlighter-rouge">8</code> - this controls how many classes our network can predict.</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="kn">from</span> <span class="nn">tensorflow.keras</span> <span class="kn">import</span> <span class="n">Input</span><span class="p">,</span> <span class="n">Model</span>
<span class="kn">from</span> <span class="nn">tensorflow.keras.layers</span> <span class="kn">import</span> <span class="n">Dense</span><span class="p">,</span> <span class="n">Flatten</span><span class="p">,</span> <span class="n">Conv2D</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="n">tf</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">set_seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="c1"># dataset of 4 images, 32x32 with 3 color channels
</span><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
<span class="n">inp</span> <span class="o">=</span> <span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">:])</span>
<span class="n">conv</span> <span class="o">=</span> <span class="n">Conv2D</span><span class="p">(</span><span class="n">filters</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)(</span><span class="n">inp</span><span class="p">)</span>
<span class="n">flat</span> <span class="o">=</span> <span class="n">Flatten</span><span class="p">()(</span><span class="n">conv</span><span class="p">)</span>
<span class="n">feature_map</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)(</span><span class="n">flat</span><span class="p">)</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'softmax'</span><span class="p">)(</span><span class="n">flat</span><span class="p">)</span>
<span class="n">mdl</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="n">inp</span><span class="p">,</span> <span class="n">outputs</span><span class="o">=</span><span class="n">out</span><span class="p">)</span>
<span class="n">mdl</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="s">"""
<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[-0.39803684, -0.08939186],
[-0.48165476, -0.28876644],
[-0.32680377, -0.24380796],
[-0.45394567, -0.28233868]], dtype=float32)>
"""</span>
</code></pre></div></div>
<h2 id="what-hyperparameters-are-important-for-a-convolutional-layer">What hyperparameters are important for a convolutional layer?</h2>
<p>The important hyperparameters in a convolutional layer are:</p>
<ul>
<li>the number of filters,</li>
<li>filter size,</li>
<li>activation function,</li>
<li>strides,</li>
<li>padding,</li>
<li>dilation rate.</li>
</ul>
<p>The <strong>number of filters</strong> determines how many patterns each layer can learn. It’s common to have the number of filters increasing with the depth of the network. Filter size is commonly set to <code class="language-plaintext highlighter-rouge">(3, 3)</code>, with a ReLu as the activation function.</p>
<p><strong>Strides can be used to skip</strong> steps in the convolution, resulting in smaller feature maps. Padding allows pixels on the edge of the image to act as if they are in the middle of an image. Dilation allow the filters to operate over a larger area of the image, while still producing feature maps of the same size.</p>
<h2 id="when-should-i-use-a-convolutional-layer">When should I use a convolutional layer?</h2>
<p>Convolution works when your data has a spatial structure - for example, images have spatial structure in height & width. You can also get this structure from a 1D signal using techniques such as Fourier Transforms, and then perform convolution in the frequency domain.</p>
<p><strong>If you are working with images, convolution is king</strong>. While there is work applying attention based models to computer vision, because of it’s similarity with our own visual cortex, it is likely that convolution will be relevant for many years to come.</p>
<p>An example of using convolution occurs in DeepMind’s 2015 DQN work. The agent learns to take decisions using pixels - making convolution a strong choice:</p>
<p><img src="/assets/ml_energy/conv.png" alt="" /></p>
<p align="center"><i>Deep convolutional neural network used in the 2015 DeepMind DQN Atari work</i></p>
<p>So what other kinds of structure can data have, other than spatial? Many types of data have a sequential structure - motivating our next two layer architectures.</p>
<p><br /></p>
<h1 id="3-lstm-layer">3. LSTM Layer</h1>
<p>The third of our layers is the LSTM, or Long Short-Term Memory layer. The LSTM is recurrent and <strong>processes data as a sequence</strong>.</p>
<p>Recurrence allows a network to experience the temporal structure of data, such as words in a sentence, or time of day.</p>
<p>A normal neural network receives a single input tensor $x$ and generates a single output tensor $y$. A recurrent architecture differs from a non-recurrent neural network in two ways:</p>
<ol>
<li>both the input $x$ & output $y$ data is <strong>processed as a sequence of timesteps</strong>,</li>
<li>the network has the <strong>ability to remember</strong> information and pass it to the next timestep.</li>
</ol>
<p>The memory of a recurrent architecture is known as the <strong>hidden state</strong> $h$. What the network chooses to pass forward in the hidden state is learnt by the network.</p>
<center><img align="center" src="/assets/four-dl-arch/recurr.png" /></center>
<p align="center"><i>A recurrent neural network</i></p>
<h3 id="entering-the-timestep-dimension">Entering the timestep dimension</h3>
<p>Working with recurrent architectures requires being comfortable with the idea of a <strong>timestep dimension</strong> - knowing how to shape your data correctly is half the battle of working with recurrence.</p>
<p>Imagine we have input data $x$, that is a sequence of integers <code class="language-plaintext highlighter-rouge">[0, 0] -> [2, 20] -> [4, 40]</code>. If we were using a fully connected layer, we could present this data to the network as a flat array:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">10</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">::</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">::</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">20</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="c1"># array([[ 0, 0, 2, 20, 4, 40, 6, 60, 8, 80]])
</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="c1"># (1, 10)
</span></code></pre></div></div>
<p>Although the sequence is obvious to us, it’s not obvious to a fully connected layer.</p>
<p><strong>All a fully connected layer would see is a list of numbers - the sequential structure would need to be learnt by the network</strong>.</p>
<p>We can restructure our data $x$ to explicitly model this sequential structure, by adding a timestep dimension. <strong>The values in our data do not change - only the shape changes</strong>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">vstack</span><span class="p">([</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">20</span><span class="p">)]).</span><span class="n">T</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="s">"""
array([[[ 0, 0],
[ 2, 20],
[ 4, 40],
[ 6, 60],
[ 8, 80]]])
"""</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="c1"># (1, 5, 2)
</span></code></pre></div></div>
<p><strong>Our data $x$ is now structured with three dimensions</strong> - <code class="language-plaintext highlighter-rouge">(batch, timesteps, features)</code>. A recurrent neural network will process the features one timestep at a time, experiencing the sequential structure of the data.</p>
<p>Now that we understand how to structure data to be used with a recurrent neural network, we can take a high-level look at details of how the LSTM layer works.</p>
<h2 id="how-does-an-lstm-layer-work">How does an LSTM layer work?</h2>
<p>The LSTM was first introduced in 1997 and has formed the backbone of modern sequence based deep learning models, excelling on challenging tasks such as machine translation. For years the state of the art in machine translation was the seq2seq model, which is powered by the LSTM.</p>
<p>The LSTM is a specific type a recurrent neural network. <strong>The LSTM addresses a challenge that vanilla recurrent neural networks struggled with - the ability to think long term</strong>.</p>
<p>In a recurrent neural network all information passed to the next time step has to fit in a single channel, the hidden state $h$.</p>
<p><strong>The LSTM addresses the long term memory problem by using two hidden states</strong>, known as the hidden state $h$ and the cell state $c$. Having two channels allows the LSTM to remember on both a long and short term.</p>
<p>Internally the LSTM makes use of three gates to control the flow of information:</p>
<ol>
<li>forget gate to determine what information to delete,</li>
<li>input gate to determine what to remember,</li>
<li>output gate to determine what to predict.</li>
</ol>
<p>One important architecture that uses LSTMs is seq2seq. The source sentence is fed through an encoder LSTM to generate a fixed length context vector. A second decoder LSTM takes this contex vector and generates the target sentence.</p>
<center><img align="center" src="/assets/four-dl-arch/seq2seq.png" /></center>
<p align="center"><i>The seq2seq model</i></p>
<p>For a deeper look at the internal of the LSTM, take a look at the excellent <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">Understanding LSTM Networks</a> from colah’s blog.</p>
<h2 id="what-is-the-intuition-and-inductive-bias-of-an-lstm">What is the intuition and inductive bias of an LSTM?</h2>
<p>A good intiutive model for the LSTM layer is to think about it like a database. <strong>The output, input and delete gates allow the LSTM to work like a database</strong> - matching the <code class="language-plaintext highlighter-rouge">GET</code>, <code class="language-plaintext highlighter-rouge">POST</code> & <code class="language-plaintext highlighter-rouge">DELETE</code> of a REST API, or the <code class="language-plaintext highlighter-rouge">read-update-delete</code> operations of a CRUD application.</p>
<p>The forget gate acts like a <code class="language-plaintext highlighter-rouge">DELETE</code>, allowing the LSTM to remove information that isn’t useful. The input gate acts like a <code class="language-plaintext highlighter-rouge">POST</code>, where the LSTM can choose information to remember. The output gate acts like a <code class="language-plaintext highlighter-rouge">GET</code>, where the LSTM chooses what to send back to a user request for information.</p>
<p>A recurrent neural network has has an inductive bias for processing data as a sequence, and for storing a memory. The LSTM adds on top of this bias for creating one long term and one short term memory channel.</p>
<h2 id="using-an-lstm-layer-with-the-keras-functional-api">Using an LSTM layer with the Keras Functional API</h2>
<p>Below is an example of how to use an LSTM layer with the Keras functional API:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="kn">from</span> <span class="nn">tensorflow.keras</span> <span class="kn">import</span> <span class="n">Input</span><span class="p">,</span> <span class="n">Model</span>
<span class="kn">from</span> <span class="nn">tensorflow.keras.layers</span> <span class="kn">import</span> <span class="n">Dense</span><span class="p">,</span> <span class="n">LSTM</span><span class="p">,</span> <span class="n">Flatten</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="n">tf</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">set_seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="c1"># dataset of 4 samples, 3 timesteps, 32 features
</span><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">32</span><span class="p">)</span>
<span class="n">inp</span> <span class="o">=</span> <span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">:])</span>
<span class="n">lstm</span> <span class="o">=</span> <span class="n">LSTM</span><span class="p">(</span><span class="mi">8</span><span class="p">)(</span><span class="n">inp</span><span class="p">)</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">2</span><span class="p">)(</span><span class="n">lstm</span><span class="p">)</span>
<span class="n">mdl</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="n">inp</span><span class="p">,</span> <span class="n">outputs</span><span class="o">=</span><span class="n">out</span><span class="p">)</span>
<span class="n">mdl</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="s">"""
<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[-0.06428523, 0.3131591 ],
[-0.04120642, 0.3528567 ],
[-0.04273851, 0.37192333],
[ 0.03797218, 0.33612275]], dtype=float32)>
"""</span>
</code></pre></div></div>
<p>You’ll notice we only get one output for each of our four samples - where are the other two timesteps? To get these, we need to use <code class="language-plaintext highlighter-rouge">return_sequences=True</code>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tf</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">set_seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="n">inp</span> <span class="o">=</span> <span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">:])</span>
<span class="n">lstm</span> <span class="o">=</span> <span class="n">LSTM</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="n">return_sequences</span><span class="o">=</span><span class="bp">True</span><span class="p">)(</span><span class="n">inp</span><span class="p">)</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">2</span><span class="p">)(</span><span class="n">lstm</span><span class="p">)</span>
<span class="n">mdl</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="n">inp</span><span class="p">,</span> <span class="n">outputs</span><span class="o">=</span><span class="n">out</span><span class="p">)</span>
<span class="n">mdl</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="s">"""
<tf.Tensor: shape=(4, 3, 2), dtype=float32, numpy=
array([[[-0.08234972, 0.12292314],
[-0.05217044, 0.19100665],
[-0.06428523, 0.3131591 ]],
[[ 0.0381453 , 0.26402596],
[ 0.04725918, 0.34620702],
[-0.04120642, 0.3528567 ]],
[[-0.21114576, 0.08922277],
[-0.02972354, 0.24037611],
[-0.04273851, 0.37192333]],
[[-0.06888272, -0.01702049],
[ 0.0117887 , 0.10608622],
[ 0.03797218, 0.33612275]]], dtype=float32)>
"""</span>
</code></pre></div></div>
<p>It’s also common to want to access the hidden states of the LSTM - this can be done using the argument <code class="language-plaintext highlighter-rouge">return_state=True</code>.</p>
<p>We now get back three tensors - the output of the network, the LSTM hidden state and the LSTM cell state. The shape of the hidden states is equal to the number of units in the LSTM:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tf</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">set_seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="n">inp</span> <span class="o">=</span> <span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">:])</span>
<span class="n">lstm</span><span class="p">,</span> <span class="n">hstate</span><span class="p">,</span> <span class="n">cstate</span> <span class="o">=</span> <span class="n">LSTM</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="n">return_sequences</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">return_state</span><span class="o">=</span><span class="bp">True</span><span class="p">)(</span><span class="n">inp</span><span class="p">)</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">2</span><span class="p">)(</span><span class="n">lstm</span><span class="p">)</span>
<span class="n">mdl</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="n">inp</span><span class="p">,</span> <span class="n">outputs</span><span class="o">=</span><span class="p">[</span><span class="n">out</span><span class="p">,</span> <span class="n">hstate</span><span class="p">,</span> <span class="n">cstate</span><span class="p">])</span>
<span class="n">out</span><span class="p">,</span> <span class="n">hstate</span><span class="p">,</span> <span class="n">cstate</span> <span class="o">=</span> <span class="n">mdl</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">hstate</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="c1"># (4, 8)
</span>
<span class="k">print</span><span class="p">(</span><span class="n">cstate</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="c1"># (4, 8)
</span></code></pre></div></div>
<p>If you wanted to access the hidden states at each timestep, then you can combine these two and use both <code class="language-plaintext highlighter-rouge">return_sequences=True</code> and <code class="language-plaintext highlighter-rouge">return_state=True</code>.</p>
<h2 id="what-hyperparameters-are-important-for-an-lstm-layer">What hyperparameters are important for an LSTM layer?</h2>
<p>For an LSTM layer, the main hyperparameter is the number of units. The number of units will determine the capacity of the layer and size of the hidden state.</p>
<p>While not a hyperparameter, it can be useful to include gradient clipping when working with LSTMs, to deal with exploding gradients that can occur from the backpropagation through time. It is also common to use lower learning rates to help manage gradients.</p>
<h2 id="when-should-i-use-an-lstm-layer">When should I use an LSTM layer?</h2>
<p>In 2023, the answer to this is never. If you have the type of data (sequential) that is suitable for an LSTM, you should look at using attention.</p>
<p>When working with sequence data, an LSTM (or it’s close cousin the GRU) used to be the best choice. <strong>One major downside of the LSTM is that they are slow to train</strong> as <strong>the error signal must be backpropagated through time</strong>. Backpropagating through an LSTM cannot be parallelized.</p>
<p>One useful feature of the LSTM is the learnt hidden state. This can be used by other models as a compressed representation of the future - such as in the <a href="https://adgefficiency.com/world-models/">2017 World Models paper</a>.</p>
<p><br /></p>
<h1 id="4-attention-layer">4. Attention Layer</h1>
<p>Attention is the youngest of our four layers.</p>
<p><strong>Since it’s introduction in 2015, attention has revolutionized natural language processing</strong>. Attention powers some of the most breathtaking achievements in deep learning, such as the GPT-X series of language models.</p>
<p>First used in combination with the LSTM based seq2seq model, attention powers the Transformer - a neural network architecture that forms the backbone of modern language models.</p>
<p><strong>Attention is as a sequence model without recurrence</strong> - by avoiding the need to do backpropagation through time, attention can be parallelized on GPU, which means it’s fast to train.</p>
<h2 id="what-is-the-intuition-and-inductive-bias-of-attention-layers">What is the intuition and inductive bias of attention layers?</h2>
<p>Attention is a simple and powerful idea - when processing a sequence, we should choose what part of sequence to take information from. The intuition is simple - <strong>some parts of a sequence are more important that others</strong>.</p>
<p>Take the example of machine translation, to translate the German sentence <code class="language-plaintext highlighter-rouge">Ich bin eine Maschine</code> into the English <code class="language-plaintext highlighter-rouge">I am a machine</code>.</p>
<p>When predicting the last word in the translation <code class="language-plaintext highlighter-rouge">machine</code>, all of our attention should be placed on the last word of the source sentence <code class="language-plaintext highlighter-rouge">Maschine</code>. There is no point looking at earlier words in the source sequence when translating this token.</p>
<p>If we take a more complex example of translating the German <code class="language-plaintext highlighter-rouge">Ich habe ein bisschen Deutsch gelernt</code> into the English <code class="language-plaintext highlighter-rouge">I have learnt a little German</code>. When predicting the third token of our English sentence (<code class="language-plaintext highlighter-rouge">learnt</code>), attention should be placed on the last token of the German sentence (<code class="language-plaintext highlighter-rouge">gelernt</code>).</p>
<center><img align="center" src="/assets/four-dl-arch/trans.png" /></center>
<p>So what inductive bias does our attention layer give us? <strong>One inductive bias of attention is alignment based on similarity</strong> - the attention layer chooses where to look based on how similar things are.</p>
<p><strong>Another inductive bias of attention is to limit & prioritize information flow</strong>. As we will see below, the use of a softmax forces an attention layer to make tradeoffs about information flow - more weight in one place means less in another.</p>
<p>There is no such restriction in a fully connected layer, where increasing one weight does not affect another. A fully connected layer can allow information to flow between all nodes in subsequent layers, and could in theory learn a similar pattern that an attention layer does. We know by now however that in theory does note mean it will occur in practice.</p>
<h2 id="how-does-an-attention-layer-work">How does an attention layer work?</h2>
<p>The attention layer receives <strong>three inputs</strong>:</p>
<ol>
<li><strong>query</strong> = what we are looking for,</li>
<li><strong>key</strong> = what we compare the query with,</li>
<li><strong>value</strong> = what we place attention over.</li>
</ol>
<p>The attention layer can be thought of as <strong>three mechanisms in sequence</strong>:</p>
<ol>
<li><strong>alignment</strong> (or similarity) of a query and keys</li>
<li><strong>softmax</strong> to convert the alignment into a probability distribution</li>
<li><strong>selecting keys</strong> based on the alignment</li>
</ol>
<center><img align="center" src="/assets/four-dl-arch/attention.png" /></center>
<p align="center"><i>The three steps in an attention layer - alignment, softmax & key selection</i></p>
<p>Different attention layers (such as Additive Attention or Dot-Product Attention) use different mechanisms in the alignment step. The softmax & key selection steps are common to all attention layers.</p>
<h3 id="query-key-and-value">Query, key and value</h3>
<p>In the same way that understanding the time-step dimension is a key step in understanding recurrent neural networks, understanding what the query, key & value mean is foundational in attention.</p>
<p>A good analogy is with the Python dictionary. Let’s start with a simple example, where we:</p>
<ul>
<li><strong>look up a query</strong> of <code class="language-plaintext highlighter-rouge">dog</code></li>
<li>to <strong>match with keys</strong> of <code class="language-plaintext highlighter-rouge">dog</code> or <code class="language-plaintext highlighter-rouge">cat</code> with values of <code class="language-plaintext highlighter-rouge">1</code> or <code class="language-plaintext highlighter-rouge">2</code> respectively</li>
<li>and <strong>select the value</strong> of <code class="language-plaintext highlighter-rouge">2</code> based on this lookup of <code class="language-plaintext highlighter-rouge">dog</code></li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">query</span> <span class="o">=</span> <span class="s">'dog'</span>
<span class="c1"># keys = 'cat', 'dog', values = 1, 2
</span><span class="n">database</span> <span class="o">=</span> <span class="p">{</span><span class="s">'cat'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s">'dog'</span><span class="p">:</span> <span class="mi">2</span><span class="p">}</span>
<span class="n">database</span><span class="p">[</span><span class="n">query</span><span class="p">]</span>
<span class="c1"># 2
</span></code></pre></div></div>
<p>In the above example, we find an exact match for our query <code class="language-plaintext highlighter-rouge">'dog'</code>. However, in a neural network, <strong>we are not working with strings - we are working with tensors</strong>. Our query, keys and values are all tensors:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">query</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mf">0.9</span><span class="p">]</span>
<span class="c1"># keys = [0, 0], [0, 1] values = [0], [1]
</span><span class="n">database</span> <span class="o">=</span> <span class="p">{[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]:</span> <span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">]:</span> <span class="p">[</span><span class="mi">1</span><span class="p">]}</span>
</code></pre></div></div>
<p>Now we don’t have an exact match for our query - <strong>instead of using an exact match, we instead can calculate a similarity</strong> (i.e. an alignment) between our query and keys, and return the closest value:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">database</span><span class="p">.</span><span class="n">similarity</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
<span class="c1"># [1]
</span></code></pre></div></div>
<p>Small technicality - often the keys are set equal to the values. This simply means that the quantity we are doing the similarity comparison with is also the quantity we will place attention over.</p>
<h2 id="attention-mechanisms">Attention mechanisms</h2>
<p>By now we know that an attention layer involves three steps:</p>
<ol>
<li><strong>alignment</strong> based on similarity,</li>
<li><strong>softmax</strong> to create attention weights,</li>
<li><strong>choosing values</strong> based on attention.</li>
</ol>
<p>The second & third steps are common to all attention layers - <strong>the differences all occur in the first step - how the alignment on similarity is done</strong>.</p>
<p>We will briefly look at two popular mechanisms - Additive Attention and Dot-Product Attention. For a more detailed look at these mechanisms, have a look at the excellent <a href="https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html">Attention? Attention!</a> by Lilian Wang.</p>
<h3 id="additive-attention">Additive Attention</h3>
<p>This first use of attention (known as Bahdanau or Additive Attention) addressed one of the limitations of the seq2seq model - namely the use of a fixed length context vector.</p>
<p>As explained in the LSTM section, the basic process in a seq2seq model is to encode the source sentence into a fixed length context vector. The issue is with all of the information from the encoder must pass through the fixed length context vector. Infomation from the entire source sequence is squeezed through this context vector inbetween the encoder & decoder.</p>
<p>In Bahdanau et. al 2015, Additive Attention is used to learn an alignment between all the encoder hidden states and the decoder hidden states. As the sequence is processed, the output of this alignment is used in the decoder to predict the next token.</p>
<h2 id="dot-product-attention">Dot-Product Attention</h2>
<p>A second type of attention is Dot-Product Attention - the alignment mechanism used in the Transformer. Instead of using addition, the Dot-Product Attention layer uses matrix multiplication to measure similarity between the query and the keys.</p>
<p>The dot-product acts like a similarity between the keys & values - below is a small program that plots both the dot-product and the cosine similarity for random data:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">defaultdict</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">scipy.spatial.distance</span> <span class="kn">import</span> <span class="n">cosine</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">128</span><span class="p">)</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">128</span><span class="p">)</span>
<span class="n">data</span><span class="p">[</span><span class="s">'cosine'</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">cosine</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">))</span>
<span class="n">data</span><span class="p">[</span><span class="s">'dot'</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">10</span><span class="p">))</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s">'cosine'</span><span class="p">],</span> <span class="n">data</span><span class="p">[</span><span class="s">'dot'</span><span class="p">])</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'cosine'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'dot-product'</span><span class="p">)</span>
</code></pre></div></div>
<center><img align="center" width="80%" src="/assets/four-dl-arch/cosine-dot-product.png" /></center>
<p align="center"><i>The relationship between the cosine similarity and the dot-product of random vectors</i></p>
<h2 id="implementing-a-single-attention-head-with-the-keras-functional-api">Implementing a Single Attention Head with the Keras Functional API</h2>
<p>Dot-Product Attention is important as it forms part of the Transformer. As you can see in the figure below, the Transformer uses multiple heads of Scaled Dot-Product Attention.</p>
<center><img align="center" width="40%" src="/assets/four-dl-arch/head.png" /></center>
<p align="center"><i>The multi-head attention layer used in the Transformer</i></p>
<p>The code below demonstrates the mechanics for a single head without scaling - see
<a href="https://www.tensorflow.org/tutorials/text/transformer">Transformer Model for Language Understanding</a> for a full implementation of a multi-head attention layer & Transformer in Tensorflow 2.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="kn">from</span> <span class="nn">tensorflow.keras</span> <span class="kn">import</span> <span class="n">Input</span><span class="p">,</span> <span class="n">Model</span>
<span class="kn">from</span> <span class="nn">tensorflow.keras.layers</span> <span class="kn">import</span> <span class="n">Dense</span>
<span class="n">qry</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">32</span><span class="p">).</span><span class="n">reshape</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">32</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="s">'float32'</span><span class="p">)</span>
<span class="n">key</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">32</span><span class="p">).</span><span class="n">reshape</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">32</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="s">'float32'</span><span class="p">)</span>
<span class="n">values</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">32</span><span class="p">).</span><span class="n">reshape</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">32</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="s">'float32'</span><span class="p">)</span>
<span class="n">q_in</span> <span class="o">=</span> <span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="mi">32</span><span class="p">))</span>
<span class="n">k_in</span> <span class="o">=</span> <span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">32</span><span class="p">))</span>
<span class="n">v_in</span> <span class="o">=</span> <span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">32</span><span class="p">))</span>
<span class="n">capacity</span> <span class="o">=</span> <span class="mi">4</span>
<span class="n">q</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'linear'</span><span class="p">)(</span><span class="n">q_in</span><span class="p">)</span>
<span class="n">k</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'linear'</span><span class="p">)(</span><span class="n">k_in</span><span class="p">)</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'linear'</span><span class="p">)(</span><span class="n">v_in</span><span class="p">)</span>
<span class="n">score</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">q</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">transpose_b</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">attention</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">score</span><span class="p">,</span> <span class="n">axis</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">attention</span><span class="p">,</span> <span class="n">v</span><span class="p">)</span>
<span class="n">mdl</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="p">[</span><span class="n">q_in</span><span class="p">,</span> <span class="n">k_in</span><span class="p">,</span> <span class="n">v_in</span><span class="p">],</span> <span class="n">outputs</span><span class="o">=</span><span class="p">[</span><span class="n">score</span><span class="p">,</span> <span class="n">attention</span><span class="p">,</span> <span class="n">output</span><span class="p">])</span>
<span class="n">sc</span><span class="p">,</span> <span class="n">attn</span><span class="p">,</span> <span class="n">out</span> <span class="o">=</span> <span class="n">mdl</span><span class="p">([</span><span class="n">qry</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">values</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'query shape </span><span class="si">{</span><span class="n">qry</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'score shape </span><span class="si">{</span><span class="n">sc</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'attention shape </span><span class="si">{</span><span class="n">attn</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'output shape </span><span class="si">{</span><span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
<span class="s">"""
query shape (4, 16, 32)
score shape (4, 16, 1)
attention shape (4, 16, 1)
output shape (4, 16, 4)
"""</span>
</code></pre></div></div>
<p>This architecture also works with a different length query (now length <code class="language-plaintext highlighter-rouge">8</code> rather than <code class="language-plaintext highlighter-rouge">16</code>):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">qry</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">32</span><span class="p">).</span><span class="n">reshape</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">32</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="s">'float32'</span><span class="p">)</span>
<span class="n">sc</span><span class="p">,</span> <span class="n">attn</span><span class="p">,</span> <span class="n">out</span> <span class="o">=</span> <span class="n">mdl</span><span class="p">([</span><span class="n">qry</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">values</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'query shape </span><span class="si">{</span><span class="n">qry</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'score shape </span><span class="si">{</span><span class="n">sc</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'attention shape </span><span class="si">{</span><span class="n">attn</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'output shape </span><span class="si">{</span><span class="n">out</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
<span class="s">"""
query shape (4, 8, 32)
score shape (4, 8, 1)
attention shape (4, 8, 1)
output shape (4, 8, 4)
"""</span>
</code></pre></div></div>
<h2 id="what-hyperparameters-are-important-in-an-attention-layer">What hyperparameters are important in an attention layer?</h2>
<p>When using attention heads as shown above, hyperparameters to consider are:</p>
<ul>
<li>size of the linear layers used to transform the query, values & keys</li>
<li>the type of attention mechanism (such as additive or dot-product)</li>
<li>how to scale the alignment before the softmax (often done using the square-root of the length of the layer)</li>
</ul>
<h2 id="when-should-i-use-an-attention-layer">When should I use an attention layer?</h2>
<p>Attention layers should be considered for <strong>any sequence problem</strong>. Unlike recurrent neural networks, they can be easily parallelized, making training fast. Fast training means either cheaper training, or more training for the same amount of compute.</p>
<p>The Transformer is a sequence model without recurrence (it doesn’t use an LSTM), allowing it to be trained without backpropagation through time.</p>
<p>One additional benefit of an attention layer is being able to use the alignment scores for interpretability - similar to how we can use the hidden state in an LSTM as a representation of the sequence.</p>
<p><br /></p>
<h1 id="summary-1">Summary</h1>
<p>I hope you enjoyed this post and found it useful! Below is a short table summarizing the article:</p>
<table>
<thead>
<tr>
<th>Layer</th>
<th>Intuition</th>
<th>Inductive Bias</th>
<th>When To Use</th>
</tr>
</thead>
<tbody>
<tr>
<td>Fully connected</td>
<td>Allow all possible connections</td>
<td>None</td>
<td>Data without structure (tabular data)</td>
</tr>
<tr>
<td>2D convolution</td>
<td>Recognizing spatial patterns</td>
<td>Local, spatial patterns</td>
<td>Data with spatial structure (images)</td>
</tr>
<tr>
<td>LSTM</td>
<td>Database</td>
<td>Sequences & memory</td>
<td>Never - use Attention</td>
</tr>
<tr>
<td>Attention</td>
<td>Focus on similarities</td>
<td>Similarity & limit information flow</td>
<td>Data with sequential structure</td>
</tr>
</tbody>
</table>
<hr />
<p><strong>Thanks for reading!</strong></p>
<p>If you enjoyed this post, check out <a href="https://adgefficiency.com/ai-ml-dl/">Artificial Intelligence, Machine Learning and Deep Learning</a>.</p>Adam Greenadam.green@adgefficiency.comExplaning the fully connected, convolution, the LSTM and attention deep learning layer architectures.A Hackers Guide to AEMO & NEM Data2022-12-10T00:00:00+00:002022-12-10T00:00:00+00:00https://adgefficiency.com/hackers-aemo<p>This is a short guide to the electricity grid & market data supplied by the Australian Energy Market Operator (AEMO) for the Australian National Electricity Market (NEM).</p>
<p>The NEM is Australia’s electricity grid in Queensland, New South Wales, Victoria, South Australia, and Tasmania.</p>
<h1 id="participant-infomation--carbon-intensities">Participant Infomation & Carbon Intensities</h1>
<p>Market participant information in the NEM is given in the <a href="https://www.aemo.com.au/-/media/Files/Electricity/NEM/Participant_Information/NEM-Registration-and-Exemption-List.xls">NEM Registration and Exemption List</a>:</p>
<p><img src="/assets/hacker_aemo/nem-reg.png" alt="" /></p>
<p>The carbon intensities for generators are given in the <a href="http://www.nemweb.com.au/Reports/CURRENT/CDEII/CO2EII_AVAILABLE_GENERATORS.CSV">Available Generators CDEII file</a>:</p>
<p><img src="/assets/hacker_aemo/nem-carbon.png" alt="" /></p>
<p>Both of these files are linked by a Dispatchable Unit Identifier (DUID), which identifies generating unit.</p>
<h1 id="interval-data">Interval Data</h1>
<p>Interval data for the NEM is provided in two sources the NEM Dispatch Engine (<a href="http://nemweb.com.au/Data_Archive/Wholesale_Electricity/NEMDE/">NEMDE</a>) and the Market Management System Data Model (<a href="http://nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/">MMSDM</a>).</p>
<h2 id="nemde">NEMDE</h2>
<p>The NEMDE dataset provides infomation about how the grid is dispatched and price are set (including infomation about the marginal generator) in the <code class="language-plaintext highlighter-rouge">NemPriceSetter</code> XML files.</p>
<p>Data for each day is provided in a single ZIP file (<a href="https://nemweb.com.au/Data_Archive/Wholesale_Electricity/NEMDE/2022/NEMDE_2022_01/NEMDE_Market_Data/NEMDE_Files/NemPriceSetter_20220101_xml.zip">NemPriceSetter_20220101_xml.zip</a>), which contains many XML files:</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code># NemPriceSetter_20220101_xml/NEMPriceSetter_2022010100100.xml
<span class="nt"><PriceSetting</span> <span class="na">PeriodID=</span><span class="s">"2022-01-01T04:05:00+10:00"</span> <span class="na">RegionID=</span><span class="s">"NSW1"</span> <span class="na">Market=</span><span class="s">"Energy"</span> <span class="na">Price=</span><span class="s">"87.69011"</span> <span class="na">Unit=</span><span class="s">"LBBG1"</span> <span class="na">DispatchedMarket=</span><span class="s">"R5RE"</span> <span class="na">BandNo=</span><span class="s">"6"</span> <span class="na">Increase=</span><span class="s">"1"</span> <span class="na">RRNBandPrice=</span><span class="s">"23.7"</span> <span class="na">BandCost=</span><span class="s">"23.7"</span> <span class="nt">/></span>
<span class="nt"><PriceSetting</span> <span class="na">PeriodID=</span><span class="s">"2022-01-01T04:05:00+10:00"</span> <span class="na">RegionID=</span><span class="s">"NSW1"</span> <span class="na">Market=</span><span class="s">"Energy"</span> <span class="na">Price=</span><span class="s">"87.69011"</span> <span class="na">Unit=</span><span class="s">"BW04"</span> <span class="na">DispatchedMarket=</span><span class="s">"R5RE"</span> <span class="na">BandNo=</span><span class="s">"1"</span> <span class="na">Increase=</span><span class="s">"-0.47368"</span> <span class="na">RRNBandPrice=</span><span class="s">"1"</span> <span class="na">BandCost=</span><span class="s">"-0.473684"</span> <span class="nt">/></span>
<span class="nt"><PriceSetting</span> <span class="na">PeriodID=</span><span class="s">"2022-01-01T04:05:00+10:00"</span> <span class="na">RegionID=</span><span class="s">"NSW1"</span> <span class="na">Market=</span><span class="s">"Energy"</span> <span class="na">Price=</span><span class="s">"87.69011"</span> <span class="na">Unit=</span><span class="s">"BW03"</span> <span class="na">DispatchedMarket=</span><span class="s">"R5RE"</span> <span class="na">BandNo=</span><span class="s">"1"</span> <span class="na">Increase=</span><span class="s">"-0.52632"</span> <span class="na">RRNBandPrice=</span><span class="s">"1"</span> <span class="na">BandCost=</span><span class="s">"-0.526316"</span> <span class="nt">/></span>
</code></pre></div></div>
<h2 id="mmsdm">MMSDM</h2>
<p>The MMSDM provides both actual data and forecasts for a range of variables - including prices, demand and electricity flows.</p>
<p>Data in the MMSDM is supplied from three different, overlapping sources:</p>
<ul>
<li><a href="http://www.nemweb.com.au/REPORTS/CURRENT/">CURRENT</a> - last 24 hours,</li>
<li><a href="http://www.nemweb.com.au/REPORTS/ARCHIVE/">ARCHIVE</a> - last 13 months,</li>
<li><a href="http://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/">MMSDM</a> - from 2009 until the end of last month.</li>
</ul>
<p>Some report names can be different across sources - for example <code class="language-plaintext highlighter-rouge">DISPATCH_SCADA</code> versus <code class="language-plaintext highlighter-rouge">UNIT_SCADA</code>.</p>
<h2 id="price-structure">Price Structure</h2>
<p>The settlement price in the NEM is known as the <strong>trading price</strong> - it is the price that matters for what generators get paid and what customers pay.</p>
<p>Historically (before October 2021) it was settled on a 30 minute basis, as the average of the six 5 minute <strong>dispatch prices</strong> for the same interval.</p>
<h2 id="aemo-timestamping">AEMO Timestamping</h2>
<p><strong>AEMO timestamp with the time at the end of the interval</strong>. This means that <code class="language-plaintext highlighter-rouge">01/01/2018 14:00</code> refers to the time period <code class="language-plaintext highlighter-rouge">01/01/2018 13:30 - 01/01/2018 14:00</code>. This will be true for columns like <code class="language-plaintext highlighter-rouge">SETTLEMENTDATE</code>, which refer to an interval. Columns like <code class="language-plaintext highlighter-rouge">LASTCHANGED</code>, which refer to a single instant in time are not affected by this.</p>
<p>I prefer shifting the AEMO time stamp backwards by one step of the index frequency (i.e. 5 minutes). This allows the following to be true:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dispatch_prices</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="s">'01/01/2018 13:30'</span><span class="p">:</span> <span class="s">'01/01/2018 14:00'</span><span class="p">].</span><span class="n">mean</span><span class="p">()</span> <span class="o">==</span> <span class="n">trading_price</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="s">'01/01/2018 13:30'</span><span class="p">]</span>
</code></pre></div></div>
<p>The shifting also allows easier alignment with external data sources such as weather, which is usually stamped with the timestamp at the beginning of the interval.</p>
<p>If the AEMO timestamp is not shifted, then the following is true:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dispatch_prices</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="s">'01/01/2018 13:35'</span><span class="p">:</span> <span class="s">'01/01/2018 14:05'</span><span class="p">].</span><span class="n">mean</span><span class="p">()</span> <span class="o">==</span> <span class="n">trading_price</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="s">'01/01/2018 14:00'</span><span class="p">]</span>
</code></pre></div></div>
<h2 id="useful-mmsdm-reports">Useful MMSDM Reports</h2>
<p>All examples below are for <a href="http://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2018/MMSDM_2018_05/MMSDM_Historical_Data_SQLLoader/DATA/">MMSDM May 2018</a>:</p>
<p><img src="/assets/hacker_aemo/mmsdm.png" alt="" /></p>
<h3 id="actual-data">Actual Data</h3>
<ul>
<li>trading price (30 & 5 min electricity price) - TRADINGPRICE - <a href="http://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2018/MMSDM_2018_05/MMSDM_Historical_Data_SQLLoader/DATA/PUBLIC_DVD_TRADINGPRICE_201805010000.zip">PUBLIC_DVD_TRADINGPRICE_201805010000.zip</a>,</li>
<li>dispatch price (5 min electricity price) - DISPATCHPRICE - <a href="http://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2018/MMSDM_2018_05/MMSDM_Historical_Data_SQLLoader/DATA/PUBLIC_DVD_DISPATCHPRICE_201805010000.zip">PUBLIC_DVD_DISPATCHPRICE_201805010000.zip</a>,</li>
<li>generation of market participants - UNIT_SCADA - <a href="http://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2018/MMSDM_2018_05/MMSDM_Historical_Data_SQLLoader/DATA/PUBLIC_DVD_DISPATCH_UNIT_SCADA_201805010000.zip">PUBLIC_DVD_DISPATCH_UNIT_SCADA_201805010000.zip</a>,</li>
<li>market participant bid volumes - BIDPEROFFER - <a href="http://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2018/MMSDM_2018_05/MMSDM_Historical_Data_SQLLoader/DATA/PUBLIC_DVD_BIDPEROFFER_201805010000.zip">PUBLIC_DVD_BIDPEROFFER_201805010000.zip</a>,</li>
<li>market participant bid prices - BIDAYOFFER - <a href="http://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2018/MMSDM_2018_05/MMSDM_Historical_Data_SQLLoader/DATA/PUBLIC_DVD_BIDDAYOFFER_201805010000.zip">PUBLIC_DVD_BIDDAYOFFER_201805010000.zip</a>,</li>
<li>demand - DISPATCHREGIONSUM - <a href="http://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2018/MMSDM_2018_05/MMSDM_Historical_Data_SQLLoader/DATA/PUBLIC_DVD_DISPATCHREGIONSUM_201805010000.zip">PUBLIC_DVD_DISPATCHREGIONSUM_201805010000.zip</a>,</li>
<li>interconnectors - INTERCONNECTORRES - <a href="http://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2018/MMSDM_2018_05/MMSDM_Historical_Data_SQLLoader/DATA/PUBLIC_DVD_DISPATCHINTERCONNECTORRES_201805010000.zip">PUBLIC_DVD_DISPATCHINTERCONNECTORRES_201805010000.zip</a>.</li>
</ul>
<h3 id="forecasts">Forecasts</h3>
<ul>
<li>trading price forecast - <a href="http://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2018/MMSDM_2018_05/MMSDM_Historical_Data_SQLLoader/PREDISP_ALL_DATA/PUBLIC_DVD_PREDISPATCHPRICE_201805010000.zip">PUBLIC_DVD_PREDISPATCHPRICE_201805010000.zip</a>,</li>
<li>dispatch price forecast - <a href="http://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2018/MMSDM_2018_05/MMSDM_Historical_Data_SQLLoader/DATA/PUBLIC_DVD_P5MIN_REGIONSOLUTION_201805010000.zip">PUBLIC_DVD_P5MIN_REGIONSOLUTION_201805010000.zip</a>.</li>
</ul>
<h1 id="ecosystem">Ecosystem</h1>
<p>A major benefit of the large & open dataset shared by AEMO is the ecosystem tools built on top of it.</p>
<h2 id="nem-data"><a href="https://github.com/ADGEfficiency/nem-data">nem-data</a></h2>
<p>A simple CLI for downloading NEMDE & MMSDM data - created & maintained by yours-truly:</p>
<div class="language-shell-session highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>pip <span class="nb">install </span>nem-data
<span class="gp">$</span><span class="w"> </span>nemdata <span class="nt">--table</span> trading-price <span class="nt">--start</span> 2020-01 <span class="nt">--end</span> 2020-12
</code></pre></div></div>
<h2 id="nemosis"><a href="https://github.com/UNSW-CEEM/NEMOSIS">NEMOSIS</a></h2>
<p>A Python package for downloading historical data published by the Australian Energy Market Operator (AEMO):</p>
<div class="language-shell-session highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>pip <span class="nb">install </span>nemosis
</code></pre></div></div>
<p>Use in Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">nemosis</span> <span class="kn">import</span> <span class="n">dynamic_data_compiler</span>
<span class="n">start_time</span> <span class="o">=</span> <span class="s">'2017/01/01 00:00:00'</span>
<span class="n">end_time</span> <span class="o">=</span> <span class="s">'2017/01/01 00:05:00'</span>
<span class="n">table</span> <span class="o">=</span> <span class="s">'DISPATCHPRICE'</span>
<span class="n">raw_data_cache</span> <span class="o">=</span> <span class="s">'C:/Users/your_data_storage'</span>
<span class="n">price_data</span> <span class="o">=</span> <span class="n">dynamic_data_compiler</span><span class="p">(</span><span class="n">start_time</span><span class="p">,</span> <span class="n">end_time</span><span class="p">,</span> <span class="n">table</span><span class="p">,</span> <span class="n">raw_data_cache</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="aemo-dashboard---interactive-map"><a href="https://www.aemo.com.au/Electricity/National-Electricity-Market-NEM/Data-dashboard">AEMO Dashboard</a> - <a href="http://www.aemo.com.au/aemo/apps/visualisations/map.html">interactive map</a></h2>
<p><img src="/assets/hacker_aemo/aemo_dashboard.png" alt="" /></p>
<h2 id="electricity-map"><a href="https://www.electricitymap.org/">Electricity Map</a></h2>
<p><img src="/assets/hacker_aemo/elect_map.png" alt="" /></p>
<h2 id="aremi"><a href="https://nationalmap.gov.au/renewables/">AREMI</a></h2>
<p><img src="/assets/hacker_aemo/aremi.png" alt="" /></p>
<h2 id="nem-log"><a href="http://nemlog.com.au/">NEM Log</a></h2>
<p><img src="/assets/hacker_aemo/nemlog.png" alt="" /></p>
<h2 id="open-nem"><a href="https://opennem.org.au/#/all-regions">Open NEM</a></h2>
<p><img src="/assets/hacker_aemo/opennem.png" alt="" /></p>
<h2 id="nem-sight"><a href="http://analytics.com.au/energy-analysis/nemsight-trading-tool/">NEM Sight</a></h2>
<p><img src="/assets/hacker_aemo/nemsight.png" alt="" /></p>
<h2 id="gas--coal-watch"><a href="https://cdn.knightlab.com/libs/timeline3/latest/embed/index.html?source=1k0rmFKexrYUBbHSb2opLO2y-f3lGx2vOUsx8uIFygro&font=Default&lang=en&start_at_end=true&initial_zoom=2&height=650">Gas & Coal Watch</a></h2>
<p><img src="/assets/hacker_aemo/gas_coal_watch.png" alt="" /></p>
<h1 id="further-reading">Further Reading</h1>
<ul>
<li><a href="https://www.aemo.com.au/Electricity/National-Electricity-Market-NEM">NEM on the AEMO website</a>,</li>
<li><a href="https://energy.unimelb.edu.au/news-and-events/news/winds-of-change-an-analysis-of-recent-changes-in-the-south-australian-electricity-market">Winds of change: An analysis of recent changes in the South Australian electricity market - University of Melbourne</a>,</li>
<li><a href="https://eprints.qut.edu.au/98895/">Li, Zili (2016) Topics in deregulated electricity markets. PhD thesis, Queensland University of Technology</a>,</li>
<li><a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3126673">Dungey et. al (2018) Strategic Bidding of Electric Power Generating Companies: Evidence from the Australian National Energy Market</a>.</li>
</ul>
<hr />
<p>Thanks for reading!</p>Adam Greenadam.green@adgefficiency.comA simple guide to data provided by AEMO for the Australia's National Electricity Market (NEM).Typical Year Forecasting of Electricity Prices2022-12-04T00:00:00+00:002022-12-04T00:00:00+00:00https://adgefficiency.com/typical-year-forecasting-electricity-prices<p><strong>Energy prices are volatile</strong> - the price of gas, oil and electricity can all change significantly year on year. Yet the energy industry <strong>ignores this year on year volatility</strong> when modelling investment decisions in energy projects.</p>
<p>This exposes projects to a significant source of <strong>hidden error</strong> in the form of variance in financial model results, leading to the wrong projects being built.</p>
<p>This post introduces a simple solution to this problem in the form of a <strong>typical year forecast</strong>.</p>
<p>You can find supporting materials for this work at <a href="https://github.com/ADGEfficiency/typical-year-forecasting-electricity-prices">adgefficiency/typical-year-forecasting-electricity-prices</a>.</p>
<h1 id="what-is-a-typical-year-forecast">What is a Typical Year Forecast?</h1>
<p>A <strong>typical year forecast</strong> uses historical data to create a <strong>single, synthetic year of data</strong>.</p>
<p>This single year forecast is suitable for use in <strong>business case modelling of energy projects</strong> - it’s not suitable for short term dispatch of energy assets.</p>
<p>A typical year forecast has the following advantages:</p>
<ul>
<li><strong>simple to create</strong> - no machine learning, gradients or iterative calculations,</li>
<li><strong>interpretable</strong> - easy to understand why one sample is selected over others,</li>
<li><strong>realistic</strong> - the forecast is made from real historical data,</li>
<li><strong>domain flexible</strong> - can be used with any time series,</li>
<li><strong>statistically flexible</strong> - can use a range of statistics to define what typical means.</li>
</ul>
<p>A typical year forecast has the following disadvantages:</p>
<ul>
<li><strong>data quantity</strong> - requires at least 2 years of historical data,</li>
<li><strong>domain knowledge</strong> - requires selection & weighting of statistics.</li>
</ul>
<p>An example of a typical year forecast is a <strong>typical metrological year</strong> (TMY) forecast, used to create a dataset of typical year of weather. TMY forecasts are commonly used in modelling solar generation or building energy use.</p>
<p>The idea & inspiration for this post came from using the <a href="https://solcast.com/tmy">TMY forecast produced by Solcast</a> - thanks Solcast for the inspiration!</p>
<h1 id="the-problem-with-the-standard-industry-approach">The Problem with the Standard Industry Approach</h1>
<p>Estimating the economic performance (simple payback, IRR, NPV or rate of return on capital) of an investment in an energy project requires combining two models - <strong>a technical model and a financial model</strong>.</p>
<p>Commonly the technical model will model a single year in isolation, and is used as an input to the financial model.</p>
<p>The financial model will model <strong>multiple years over time</strong> (to model economic return over time), using the technical results as the basis for the first year with the financial inputs (such as prices) forecasted forward based on the single year technical results.</p>
<p>In the absence of forecasted energy prices across the future project lifetime, <strong>energy prices are often modelled in a similar way to the technical model</strong> - taking a single reference year of prices and forecasting them forward with assumptions of inflation.</p>
<p>A simple example of how a technical & financial model combine is given below:</p>
<ul>
<li>a technical model outputs annual savings of <code class="language-plaintext highlighter-rouge">150 MWh</code> of electricity,</li>
<li>we assume electricity prices at <code class="language-plaintext highlighter-rouge">100 $/MWh</code></li>
<li>capital investment is estimated at <code class="language-plaintext highlighter-rouge">$ 25,000</code>.</li>
</ul>
<p>The technical inputs & price assumptions are then forecast forward (here without inflation) to calculate cumulative savings:</p>
<table>
<thead>
<tr>
<th style="text-align: right">year</th>
<th style="text-align: right">capex</th>
<th style="text-align: right">savings_mwh</th>
<th style="text-align: right">price</th>
<th style="text-align: right">savings_$</th>
<th style="text-align: right">cumulative_savings_$</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">0</td>
<td style="text-align: right">25000</td>
<td style="text-align: right">150</td>
<td style="text-align: right">100</td>
<td style="text-align: right">15000</td>
<td style="text-align: right">-10000</td>
</tr>
<tr>
<td style="text-align: right">1</td>
<td style="text-align: right">0</td>
<td style="text-align: right">150</td>
<td style="text-align: right">100</td>
<td style="text-align: right">15000</td>
<td style="text-align: right">5000</td>
</tr>
<tr>
<td style="text-align: right">2</td>
<td style="text-align: right">0</td>
<td style="text-align: right">150</td>
<td style="text-align: right">100</td>
<td style="text-align: right">15000</td>
<td style="text-align: right">20000</td>
</tr>
<tr>
<td style="text-align: right">3</td>
<td style="text-align: right">0</td>
<td style="text-align: right">150</td>
<td style="text-align: right">100</td>
<td style="text-align: right">15000</td>
<td style="text-align: right">35000</td>
</tr>
</tbody>
</table>
<p>It’s not common to see both the project capex and savings in the same year (usually you need to build something before it gives a saving) - for this simple example please forgive this!</p>
<h2 id="why-using-the-most-recent-prices-is-wrong">Why Using The Most Recent Prices is Wrong</h2>
<p>Choosing the reference year for prices is commonly done by:</p>
<ul>
<li>taking the most recent prices,</li>
<li>taking the most recent full calendar year of prices,</li>
<li>taking the prices that align with the technical model.</li>
</ul>
<p>If we were setting up our model in November 2022 with a technical model based on 2019 data, the <strong>standard industry approach</strong> would likely be one of the following:</p>
<ul>
<li>the <strong>most recent prices</strong> - October 2021 to September 2022,</li>
<li>the <strong>most recent calendar year</strong> - January 2021 to December 2021,</li>
<li>prices that <strong>align with the technical data</strong> - January 2019 to December 2019.</li>
</ul>
<p>Below we will demonstrate why all of these commonly used methodologies <strong>introduce a large source of error</strong>.</p>
<h2 id="error-of-using-recent-prices">Error of Using Recent Prices</h2>
<p>In our example above, we assumed prices at <code class="language-plaintext highlighter-rouge">100 $/MWh</code>. The figure below uses the same financial model with the actual annual average electricity prices for South Australia:</p>
<p><img src="/assets/typical-year/f1.png" alt="Project savings versus annual average electricity prices." /></p>
<p><strong>Look at the variance of these results!</strong> Around half of our projects lose money, with the other half being profitable.</p>
<p>This variance error that the standard industry approaches are hiding - normally we only get a single estimate, without seeing the spread across different years of price data.</p>
<p>This variance in project performance is only occurring based on <em>when we do our modelling</em> - not based on the fundamental, underlying economics of the project.</p>
<p><strong>We can do better!</strong></p>
<h1 id="creating-a-typical-year-forecast">Creating a Typical Year Forecast</h1>
<p>Creating a typical year forecast requires defining what typical means.</p>
<p>For these forecasts we will <strong>define typical as similarity</strong> - our typical year forecast will be made of <em>samples of data that are most similar to all the other data</em>.</p>
<p>We can <strong>quantify similarity by defining an error metric</strong> - the error between <strong>statistics measured across all our data and statistics measured across a candidate sample</strong>. The samples that minimize this error will be selected and used in our forecast.</p>
<p>For our first typical year forecast, we will create a forecast based on a single statistic - <strong>the average price within a month</strong>.</p>
<p>The basic idea is as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Creating a Typical Year Forecast based on the Mean with 5 Years of Historical Data.
</span>
<span class="c1"># Iterate across each month in a year (12 months in total).
</span><span class="k">for</span> <span class="n">each</span> <span class="n">month</span> <span class="ow">in</span> <span class="n">a</span> <span class="n">year</span> <span class="p">(</span><span class="n">Jan</span><span class="p">,</span> <span class="n">Feb</span> <span class="p">...</span> <span class="n">Nov</span><span class="p">,</span> <span class="n">Dec</span><span class="p">)</span>
<span class="c1"># Calculate one long term statistic across all 5 years for this one month.
</span> <span class="n">long_term_mean</span> <span class="o">=</span> <span class="n">historical_data</span><span class="p">[</span><span class="n">month</span><span class="p">].</span><span class="n">mean</span><span class="p">()</span>
<span class="c1"># Iterate across our historical data, selecting this one month,
</span> <span class="c1"># 5 months across 5 years, all the same month.
</span> <span class="k">for</span> <span class="n">year</span> <span class="ow">in</span> <span class="n">historical_data</span>
<span class="n">sample_mean</span> <span class="o">=</span> <span class="n">year</span><span class="p">[</span><span class="n">month</span><span class="p">].</span><span class="n">mean</span><span class="p">()</span>
<span class="c1"># Calculate the error of this month versus the long term statistic.
</span> <span class="n">sample_error</span> <span class="o">=</span> <span class="n">absolute</span><span class="p">(</span><span class="n">sample_mean</span> <span class="o">-</span> <span class="n">long_term_mean</span><span class="p">)</span>
<span class="c1"># Select the sample with the lowest sample error,
</span> <span class="c1"># this is the historical month we will use in our typical year forecast.
</span> <span class="n">selected_sample</span> <span class="o">=</span> <span class="n">argmin</span><span class="p">(</span><span class="n">sample_errors</span><span class="p">)</span>
</code></pre></div></div>
<p>After following this procedure, we will select 12 monthly samples - one for each month in a year, creating our typical year forecast.</p>
<h2 id="typical-year-forecast-for-south-australian-electricity-prices">Typical Year Forecast for South Australian Electricity Prices</h2>
<p>To further demonstrate the idea, we will first limit ourselves to <strong>forecasting a single month</strong> - January, for electricity prices in South Australia, using 10 years of historical data.</p>
<p>Let’s first start by <strong>calculating our long term statistic</strong> - the average price in January across the entire dataset, which is <code class="language-plaintext highlighter-rouge">85.449 $/MWh</code>.</p>
<p>We can then look at what the average price was in each January and calculate the <strong>error versus the long term statistic</strong>.</p>
<p>This leads us to selecting January 2017 as our typical month of electricity prices:</p>
<table>
<thead>
<tr>
<th style="text-align: right">year</th>
<th style="text-align: left">month</th>
<th style="text-align: right">price-mean</th>
<th style="text-align: right">long-term-mean</th>
<th style="text-align: right">error-mean</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">2012</td>
<td style="text-align: left">January</td>
<td style="text-align: right">25.6153</td>
<td style="text-align: right">85.449</td>
<td style="text-align: right">59.8337</td>
</tr>
<tr>
<td style="text-align: right">2013</td>
<td style="text-align: left">January</td>
<td style="text-align: right">59.1246</td>
<td style="text-align: right">85.449</td>
<td style="text-align: right">26.3244</td>
</tr>
<tr>
<td style="text-align: right">2014</td>
<td style="text-align: left">January</td>
<td style="text-align: right">88.8675</td>
<td style="text-align: right">85.449</td>
<td style="text-align: right">3.41845</td>
</tr>
<tr>
<td style="text-align: right">2015</td>
<td style="text-align: left">January</td>
<td style="text-align: right">34.68</td>
<td style="text-align: right">85.449</td>
<td style="text-align: right">50.769</td>
</tr>
<tr>
<td style="text-align: right">2016</td>
<td style="text-align: left">January</td>
<td style="text-align: right">50.2573</td>
<td style="text-align: right">85.449</td>
<td style="text-align: right">35.1917</td>
</tr>
<tr>
<td style="text-align: right">2017</td>
<td style="text-align: left">January</td>
<td style="text-align: right">84.2589</td>
<td style="text-align: right">85.449</td>
<td style="text-align: right"><strong>1.19009</strong></td>
</tr>
<tr>
<td style="text-align: right">2018</td>
<td style="text-align: left">January</td>
<td style="text-align: right">158.757</td>
<td style="text-align: right">85.449</td>
<td style="text-align: right">73.3081</td>
</tr>
<tr>
<td style="text-align: right">2019</td>
<td style="text-align: left">January</td>
<td style="text-align: right">241.025</td>
<td style="text-align: right">85.449</td>
<td style="text-align: right">155.576</td>
</tr>
<tr>
<td style="text-align: right">2020</td>
<td style="text-align: left">January</td>
<td style="text-align: right">83.2037</td>
<td style="text-align: right">85.449</td>
<td style="text-align: right">2.24526</td>
</tr>
<tr>
<td style="text-align: right">2021</td>
<td style="text-align: left">January</td>
<td style="text-align: right">28.7008</td>
<td style="text-align: right">85.449</td>
<td style="text-align: right">56.7482</td>
</tr>
</tbody>
</table>
<p>We can then repeat the procedure above to forecast the remaining 11 months of the year, ending up with 12 months that make up our typical year forecast:</p>
<table>
<thead>
<tr>
<th style="text-align: right">year</th>
<th style="text-align: left">month</th>
<th style="text-align: right">price-mean</th>
<th style="text-align: right">long-term-mean</th>
<th style="text-align: right">error-mean</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">2017</td>
<td style="text-align: left">January</td>
<td style="text-align: right">84.2589</td>
<td style="text-align: right">85.449</td>
<td style="text-align: right">1.19009</td>
</tr>
<tr>
<td style="text-align: right">2020</td>
<td style="text-align: left">February</td>
<td style="text-align: right">64.1771</td>
<td style="text-align: right">71.2239</td>
<td style="text-align: right">7.04685</td>
</tr>
<tr>
<td style="text-align: right">2021</td>
<td style="text-align: left">March</td>
<td style="text-align: right">68.7727</td>
<td style="text-align: right">66.6858</td>
<td style="text-align: right">2.08692</td>
</tr>
<tr>
<td style="text-align: right">2021</td>
<td style="text-align: left">April</td>
<td style="text-align: right">52.1361</td>
<td style="text-align: right">64.1214</td>
<td style="text-align: right">11.9854</td>
</tr>
<tr>
<td style="text-align: right">2016</td>
<td style="text-align: left">May</td>
<td style="text-align: right">70.6976</td>
<td style="text-align: right">70.1316</td>
<td style="text-align: right">0.565976</td>
</tr>
<tr>
<td style="text-align: right">2021</td>
<td style="text-align: left">June</td>
<td style="text-align: right">84.3886</td>
<td style="text-align: right">81.6753</td>
<td style="text-align: right">2.71335</td>
</tr>
<tr>
<td style="text-align: right">2021</td>
<td style="text-align: left">July</td>
<td style="text-align: right">91.1873</td>
<td style="text-align: right">94.7737</td>
<td style="text-align: right">3.58638</td>
</tr>
<tr>
<td style="text-align: right">2016</td>
<td style="text-align: left">August</td>
<td style="text-align: right">66.2397</td>
<td style="text-align: right">64.8625</td>
<td style="text-align: right">1.37717</td>
</tr>
<tr>
<td style="text-align: right">2012</td>
<td style="text-align: left">September</td>
<td style="text-align: right">53.7977</td>
<td style="text-align: right">54.7594</td>
<td style="text-align: right">0.961707</td>
</tr>
<tr>
<td style="text-align: right">2012</td>
<td style="text-align: left">October</td>
<td style="text-align: right">50.9616</td>
<td style="text-align: right">52.3186</td>
<td style="text-align: right">1.35705</td>
</tr>
<tr>
<td style="text-align: right">2016</td>
<td style="text-align: left">November</td>
<td style="text-align: right">61.8883</td>
<td style="text-align: right">57.3279</td>
<td style="text-align: right">4.56045</td>
</tr>
<tr>
<td style="text-align: right">2015</td>
<td style="text-align: left">December</td>
<td style="text-align: right">66.8321</td>
<td style="text-align: right">67.2765</td>
<td style="text-align: right">0.444369</td>
</tr>
</tbody>
</table>
<p>Our typical year forecast, in all it’s light blue glory:</p>
<p><img src="/assets/typical-year/f2.png" alt="Typical year forecast using the mean as a statistic." /></p>
<p>We can compare this typical year forecast to actual historical prices - for the years where we have sampled our typical month from, our forecast directly overlaps the historical data:</p>
<p><img src="/assets/typical-year/f3.png" alt="Comparing our typical year forecast using the mean as a statistic to historical data." /></p>
<h2 id="extending-the-forecast-with-more-statistics">Extending the Forecast With More Statistics</h2>
<p>Above we only considered the mean when selecting a month. The mean is a measurement of the <em>central tendency</em> of a distribution - using the mean to select a month will mean our forecast has a similar central point to the long term average.</p>
<p>For some energy models, <strong>the variance is more important than the average</strong>.</p>
<p>The variance is how <em>spread out</em> prices are - it’s important for batteries operating in wholesale arbitrage, as this spread puts an upper limit on the profitability of shifting of electricity between intervals can be.</p>
<p>Our procedure for creating a typical year forecast based on <strong>both the mean and the variance</strong> is similar to only considering the mean.</p>
<p>We instead calculate two additional statistics (the long term standard deviation and the sample standard deviation), and include them in our sample error:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Creating a typical year forecast based on the mean & standard deviation
</span>
<span class="c1"># Iterate across each month in a year.
</span><span class="k">for</span> <span class="n">month</span> <span class="ow">in</span> <span class="p">(</span><span class="n">Jan</span><span class="p">,</span> <span class="n">Feb</span> <span class="p">...</span> <span class="n">Nov</span><span class="p">,</span> <span class="n">Dec</span><span class="p">):</span>
<span class="c1"># Calculate two statistics - long term mean & standard deviation.
</span> <span class="n">long_term_mean</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">long_term_std</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">std</span><span class="p">()</span>
<span class="c1"># Iterate across historical data & calculate sample errors,
</span> <span class="c1"># using both long term statistics
</span> <span class="k">for</span> <span class="n">year</span> <span class="ow">in</span> <span class="p">(</span><span class="n">historical</span> <span class="n">data</span><span class="p">):</span>
<span class="n">sample_mean</span> <span class="o">=</span> <span class="n">month</span><span class="p">.</span><span class="n">year</span><span class="p">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">sample_std</span> <span class="o">=</span> <span class="n">month</span><span class="p">.</span><span class="n">year</span><span class="p">.</span><span class="n">std</span><span class="p">()</span>
<span class="n">sample_error</span> <span class="o">=</span> <span class="n">absolute</span><span class="p">(</span><span class="n">long_term_mean</span> <span class="o">-</span> <span class="n">sample_mean</span><span class="p">)</span> <span class="o">+</span> <span class="n">absolute</span><span class="p">(</span><span class="n">long_term_std</span> <span class="o">-</span> <span class="n">sample_std</span><span class="p">)</span>
<span class="c1"># Select sample that minimizes error.
</span> <span class="n">selected_sample</span> <span class="o">=</span> <span class="n">argmin</span><span class="p">(</span><span class="n">sample_errors</span><span class="p">)</span>
</code></pre></div></div>
<p>Taking this approach again, we end up with our typical year forecast - different from our previous forecast where we only used the mean:</p>
<table>
<thead>
<tr>
<th style="text-align: left">month</th>
<th style="text-align: right">year</th>
<th style="text-align: right">price-mean</th>
<th style="text-align: right">long-term-mean</th>
<th style="text-align: right">price-std</th>
<th style="text-align: right">long-term-std</th>
<th style="text-align: right">error</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">January</td>
<td style="text-align: right">2020</td>
<td style="text-align: right">83.2037</td>
<td style="text-align: right">85.449</td>
<td style="text-align: right">519.785</td>
<td style="text-align: right">504.705</td>
<td style="text-align: right">17.3251</td>
</tr>
<tr>
<td style="text-align: left">February</td>
<td style="text-align: right">2018</td>
<td style="text-align: right">109.17</td>
<td style="text-align: right">71.2239</td>
<td style="text-align: right">290.873</td>
<td style="text-align: right">300.955</td>
<td style="text-align: right">48.0282</td>
</tr>
<tr>
<td style="text-align: left">March</td>
<td style="text-align: right">2020</td>
<td style="text-align: right">46.9517</td>
<td style="text-align: right">66.6858</td>
<td style="text-align: right">225.829</td>
<td style="text-align: right">271.301</td>
<td style="text-align: right">65.2057</td>
</tr>
<tr>
<td style="text-align: left">April</td>
<td style="text-align: right">2015</td>
<td style="text-align: right">39.9493</td>
<td style="text-align: right">64.1214</td>
<td style="text-align: right">100.387</td>
<td style="text-align: right">99.2508</td>
<td style="text-align: right">25.3085</td>
</tr>
<tr>
<td style="text-align: left">May</td>
<td style="text-align: right">2016</td>
<td style="text-align: right">70.6976</td>
<td style="text-align: right">70.1316</td>
<td style="text-align: right">132.686</td>
<td style="text-align: right">133.63</td>
<td style="text-align: right">1.5091</td>
</tr>
<tr>
<td style="text-align: left">June</td>
<td style="text-align: right">2021</td>
<td style="text-align: right">84.3886</td>
<td style="text-align: right">81.6753</td>
<td style="text-align: right">96.1186</td>
<td style="text-align: right">130.305</td>
<td style="text-align: right">36.8999</td>
</tr>
<tr>
<td style="text-align: left">July</td>
<td style="text-align: right">2015</td>
<td style="text-align: right">73.5053</td>
<td style="text-align: right">94.7737</td>
<td style="text-align: right">226.191</td>
<td style="text-align: right">236.491</td>
<td style="text-align: right">31.5684</td>
</tr>
<tr>
<td style="text-align: left">August</td>
<td style="text-align: right">2013</td>
<td style="text-align: right">71.2364</td>
<td style="text-align: right">64.8625</td>
<td style="text-align: right">88.1036</td>
<td style="text-align: right">103.648</td>
<td style="text-align: right">21.9185</td>
</tr>
<tr>
<td style="text-align: left">September</td>
<td style="text-align: right">2012</td>
<td style="text-align: right">53.7977</td>
<td style="text-align: right">54.7594</td>
<td style="text-align: right">62.1015</td>
<td style="text-align: right">75.617</td>
<td style="text-align: right">14.4772</td>
</tr>
<tr>
<td style="text-align: left">October</td>
<td style="text-align: right">2019</td>
<td style="text-align: right">67.3398</td>
<td style="text-align: right">52.3186</td>
<td style="text-align: right">92.2279</td>
<td style="text-align: right">108.001</td>
<td style="text-align: right">30.7947</td>
</tr>
<tr>
<td style="text-align: left">November</td>
<td style="text-align: right">2019</td>
<td style="text-align: right">50.8623</td>
<td style="text-align: right">57.3279</td>
<td style="text-align: right">88.3317</td>
<td style="text-align: right">109.014</td>
<td style="text-align: right">27.1474</td>
</tr>
<tr>
<td style="text-align: left">December</td>
<td style="text-align: right">2013</td>
<td style="text-align: right">79.5734</td>
<td style="text-align: right">67.2765</td>
<td style="text-align: right">372.848</td>
<td style="text-align: right">318.756</td>
<td style="text-align: right">66.3892</td>
</tr>
</tbody>
</table>
<p>We can compare our two typical year forecasts directly:</p>
<p><img src="/assets/typical-year/f4.png" alt="Typical year forecast using the mean as a statistic." /></p>
<p>Typical year forecasting based on both the mean and the variance is selecting months with higher prices - including more of the tasty price spikes that makes Australia’s National Electricity Market (NEM) so interesting for battery storage.</p>
<h1 id="evaluating-the-typical-year-forecast">Evaluating the Typical Year Forecast</h1>
<p>Let’s return to our original motivating example, with an additional estimate of our project cumulative savings using our typical year forecast based on using the mean (show as 2052 in green):</p>
<p><img src="/assets/typical-year/f5.png" alt="Typical year forecast using the mean as a statistic." /></p>
<p><strong>How great is that!</strong></p>
<p>Our typical year forecast does a <strong>fantastic job of cutting through the variance</strong> - modelling our project right in the middle of the high variance estimates we get when taking the traditional, industry standard approaches of using historical price data.</p>
<p>No longer are we slaves to the cruel master of time (well, perhaps we still are) - as the years go by, our estimation of project economics will stay stable and consistent, rather than varying wildly based on when we are doing our modelling.</p>
<p>As new price data becomes available, our typical year forecast will change (due to both the long term statistics changing, or recent data being more typical), but the variance from these changes will be minor compared to the massive year on year swings we get with the standard industry approaches.</p>
<h1 id="discussion">Discussion</h1>
<p>Above we have seen how great our typical year forecast is at reducing the variance of our estimates of project performance - let’s now discuss some challenges and potential extensions to this simple typical year forecasting method.</p>
<h2 id="challenges">Challenges</h2>
<h3 id="data-quantity">Data Quantity</h3>
<p>This methodology requires multiple years of data - if we only have access to a single year, this method is not appropriate.</p>
<h3 id="alignment">Alignment</h3>
<p>One problem that arises when concatenating interval data from different time periods together is alignment at the intersection - the sample below from the typical year forecast produced above shows the issue - our forecast jumps from Tuesday in January 2017 to Friday 2020:</p>
<table>
<thead>
<tr>
<th style="text-align: left">forecast</th>
<th style="text-align: left">original-timestamps</th>
<th style="text-align: right">price</th>
<th style="text-align: right">day-of-week-forecast</th>
<th style="text-align: right">day-of-week-original</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">2052-01-31 23:50:00</td>
<td style="text-align: left">2017-01-31 23:50:00</td>
<td style="text-align: right">39.52</td>
<td style="text-align: right">2</td>
<td style="text-align: right">1</td>
</tr>
<tr>
<td style="text-align: left">2052-01-31 23:55:00</td>
<td style="text-align: left">2017-01-31 23:55:00</td>
<td style="text-align: right">39.52</td>
<td style="text-align: right">2</td>
<td style="text-align: right">1</td>
</tr>
<tr>
<td style="text-align: left">2052-02-01 00:00:00</td>
<td style="text-align: left">2020-02-01 00:00:00</td>
<td style="text-align: right">299.2</td>
<td style="text-align: right">3</td>
<td style="text-align: right">5</td>
</tr>
<tr>
<td style="text-align: left">2052-02-01 00:05:00</td>
<td style="text-align: left">2020-02-01 00:05:00</td>
<td style="text-align: right">299.2</td>
<td style="text-align: right">3</td>
<td style="text-align: right">5</td>
</tr>
</tbody>
</table>
<p>This misalignment will cause issues with the incorrect number of weekdays or weekends in a year - important as energy demand and price has strong weekly seasonality.</p>
<p>This alignment problem also occurs when you don’t use a typical year forecast - for example if you use price data from 2022 with technical data from 2010.</p>
<h3 id="domain-expertise">Domain Expertise</h3>
<p>Domain expertise is required to setup a typical year forecast - primarily in defining the appropriate statistics.</p>
<p>Using multiple statistics can also require weighting - for example if the standard deviation is orders of magnitude higher than the mean, we may want to weight the mean higher.</p>
<h2 id="extensions--improvements">Extensions & Improvements</h2>
<h3 id="higher-frequency-sampling">Higher Frequency Sampling</h3>
<p>In the examples above we have selected samples on a monthly basis - it is possible to instead select samples on a different frequency, such as week of the year (52 weeks) or day of the year (365 days).</p>
<h3 id="more-statistics">More Statistics</h3>
<p>One advantage of this methodology are flexibility of statistics we choose - unlike a loss function for a neural network, they do not need to be differentiable.</p>
<p>For example, we could use statistics like:</p>
<ul>
<li>mean, median, mode,</li>
<li>number of time periods above a threshold price,</li>
<li>number of negative prices.</li>
</ul>
<p>This is an exciting feature of typical year forecasting - the <strong>flexibility and simplicity of using any statistic</strong> that aligns with what your technical and financial models need to align with your business goals.</p>
<h1 id="summary">Summary</h1>
<p>In this post we introduced <em>typical year forecasting</em> - a flexible, powerful forecasting method suitable for use in energy project business case modelling.</p>
<p>Typical year forecast address a <strong>hidden flaw in the price assumptions commonly used in industry</strong> - the large errors introduced by using recent price data.</p>
<p>A typical year forecast addresses these issues by <strong>selecting historical price data that is most similar to all the historical data</strong>.</p>
<p>Typical year forecasts have the following advantages:</p>
<ul>
<li><strong>simple to create</strong> - no machine learning, gradients or iterative calculations,</li>
<li><strong>interpretable</strong> - easy to understand why one sample is selected over others,</li>
<li><strong>realistic</strong> - the forecast is made from actual historical data,</li>
<li><strong>domain flexible</strong> - can be used with any time series (not just electricity prices),</li>
<li><strong>statistically flexible</strong> - can use a range of statistics to define what typical means.</li>
</ul>
<p>A typical year forecast has the following disadvantages:</p>
<ul>
<li><strong>data quantity</strong> - requires at least 2 years of historical data,</li>
<li><strong>domain knowledge</strong> - requires selecting & weighting of statistics based on problem understanding.</li>
</ul>
<p>Further extensions on the methods shown above include:</p>
<ul>
<li><strong>higher frequency sampling</strong> on a weekly or daily basis,</li>
<li>using a <strong>variety of statistics</strong> to define similarity, such as the number of price spikes or the number of negative prices.</li>
</ul>
<hr />
<p>Thanks for reading!</p>
<p>If you enjoyed this post, make sure to check out <a href="https://adgefficiency.com/energy-py-linear-forecast-quality/">Measuring Forecast Quality using Linear Programming</a>.</p>
<p>You can find the materials to reproduce this analysis at <a href="https://github.com/ADGEfficiency/typical-year-forecasting-electricity-prices">adgefficiency/typical-year-electricity-price-forecasting</a>.</p>Adam Greenadam.green@adgefficiency.comImprove your energy project modelling with this simple & flexible forecasting technique.Jevon’s Paradox2022-09-04T00:00:00+00:002022-09-04T00:00:00+00:00https://adgefficiency.com/jevons-paradox<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>created: 2017-10-30, updated: 2022-09-04
</code></pre></div></div>
<p>It’s intuitive that improving energy efficiency will reduce energy use. <strong>Unfortunately, it’s not that simple</strong>.</p>
<h2 id="the-coal-question">The Coal Question</h2>
<p>In the 1865 book <em>The Coal Question</em> W. Stanley Jevons points out that efficiency improvements in the production of iron occurred at the same time as increases in the total amount of coal used to produce iron.</p>
<p><strong>The improved efficiency of coal production did not reduce coal consumption - instead coal consumption increased</strong>. LED lighting is a modern example of Jevons Paradox, with high efficiency LED lights covering the planet.</p>
<p>This is Jevons Paradox - that improving efficiency of resource production leads to increases in resource consumption. <strong>This is an inconvenient truth for energy efficiency</strong>.</p>
<h2 id="thinking-in-second--third-order-effects">Thinking In Second & Third Order Effects</h2>
<p>It’s not that efficiency doesn’t work - improving efficiency means there will be less primary energy per unit of utility.</p>
<p>It’s what happens after that is the problem - the efficiency gains are cancelled out by second and third order effects. Let’s look at some of the effects of improving the efficiency of gas-fired heating:</p>
<ul>
<li><strong>first order effect</strong> - less gas is required to supply the same amount of heat. This effect is positive - we don’t burn as much gas to provide the same amount of energy.</li>
<li><strong>second order effect</strong> - we now get more heat for the same amount of money. We spend the same amount, we get more heat - but no carbon saving. We can afford to heat bigger homes for the same amount of gas.</li>
<li><strong>third order effect</strong> - increased efficiency leads to less money paid by consumers) for gas - meaning this money can be spent elsewhere. What does the economy do with this saved money?</li>
</ul>
<p>If the efficiency saving is spent on taking a long haul holiday, we could actually see an increase in global carbon emissions. <strong>We improve the efficiency of supplying heat but overall as a civilization we burn more carbon</strong>. Alternatively if the saving is spent on building cleaner energy generation then even increases in utility could lead to a carbon saving.</p>
<p><strong>It’s very difficult to understand what effect Jevons Paradox has across different consumers, economies and technologies</strong>. Measuring the first order effects of energy efficiency projects is notoriously difficult - let alone any second or third order effects.</p>
<h2 id="is-energy-efficiency-still-worthwhile">Is Energy Efficiency Still Worthwhile?</h2>
<p>Energy efficiency drives economic progress - this makes it worth doing. Yet for those concerned with decarbonization, energy efficiency may not be as effective as expected.</p>
<p>Jevons Paradox does not only apply - negative second or third order effects of energy efficiency can be smaller or larger than the efficiency saving. There is also huge value from increasing adoption of advanced technology, such as the additional light from we get from LED lights.</p>
<p>In some cases however, focusing on making sure energy comes from clean primary sources is a safer bet than trying to use dirty energy more efficiently - you may just end up using more dirty energy as a result.</p>
<h2 id="further-reading">Further Reading</h2>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Jevons_paradox">Jevons Paradox - Wikipedia</a></li>
<li><a href="http://bigthink.com/politeia/the-energy-efficiency-paradox">The Energy Efficiency Paradox</a></li>
<li><a href="http://www.nakedcapitalism.com/2011/10/energy-efficiency-doesn%e2%80%99t-work.html">Energy Efficiency Doesn’t Work</a></li>
<li><a href="http://reason.com/archives/2012/10/31/the-paradox-of-energy-efficiency">The Paradox of Energy Efficiency</a></li>
</ul>
<hr />
<p>Thanks for reading!</p>Adam Greenadam.green@adgefficiency.comEnergy efficiency is not so simple.