ADG Efficiency

From Vim & QWERTY to Neovim & DVORAK

2023-11-21T00:00:00+00:00

There are no veils, curtains, doors, walls or anything between what pours out of Bob’s hand onto the page and what is somehow available to the core of people who are believers in him.

There’s some people who’d say ‘You know, not interested’.

But if you’re interested, he goes way, way deep.

Joan Baez on Bob Dylan - No Direction Home

I’m a Neovim, DVORAK & split keyboard user.

This post details my transitions between these tools:

Atom to Vim to Neovim,
QWERTY to DVORAK keyboard layout,
a traditional to split keyboard.

At the end of this post is a table summarizing of each transition:

how long it took,
the productivity increase,
health improvements,
whether I would recommend it.

The Journey

I started my programming journey in early 2017 - on Windows laptop using the now deceased Atom editor. I didn’t know any better!

Programming is both my profession and a hobby. I enjoy working on improving my tools and workflows, which have additional benefits of making me a more effective programmer and improving the health of my aging, tired body.

Not all developers are like this - some of the best programmers I’ve worked with have no interest in changing keyboard shortcuts, let alone learning Lua to configure their text editor.

To each their own - but if you are interested, this goes way, way deep.

Vim

I started to learn Vim in the Christmas holidays of 2017 - I cannot remember exactly why.

The first few days were tough - it took around a week to feel comfortable with the basics of Vim such as hjkl, the different modes (Normal, Insert etc), moving between splits and moving to different places a file.

After two weeks I felt as productive as I was in Atom - beyond that my productivity has become more powerful than you could possibly imagine.

Over the years I added colorschemes, plugins, keybinds, macros & abbreviations - you can find my final .vimrc here.

I do still use Vim when I’m working on remote servers - sometimes I’ll clone my dotfiles if I’ll be working there for a while and don’t want to install Neovim.

Alongside Vim I use Tmux and fzf. Tmux and fzf are as crucial for making Vim your main text editor as Vim itself. Without any of the three, my terminal-based development style would not work. This is one of the places where people can get stuck with Vim - you need more than Vim to make a productive Vim setup.

Tmux is used for terminal multiplexing - allowing the ability to open terminal windows alongside each other or in different windows.

I use fzf for finding and opening files, both from the terminal with **<TAB> and within Vim using <Space> to run fzf in the current directory via a keybinding.

A healthy use of shell aliases and custom functions are also important to working with a terminal editor like Vim.

I use a script s to quickly use fzf to search for files to open in my $EDITOR from the current directory (script is here):

#!/usr/bin/env zsh

TERM_HEIGHT=$(tput lines)
MIN_HEIGHT=20

# prompt for files using fzf
files=$(if [ "$TERM_HEIGHT" -ge "$MIN_HEIGHT" ]; then
    fzf --preview 'bat -p --color=always {}' --height 60% -m
else
    fzf --no-preview --height 40% -m
fi)

# check if fzf was interrupted by Ctrl-C
if [ $? -eq 0 ]; then
  $EDITOR ${(f)files}
fi

Vim is amazing, but outdated - use Neovim instead.

Vim itself however still has a lot of awesomeness. The ecosystem of plugins and customization is still fantastic.

The initial configuration for Vim can be challenging - I would budget 1-2 days to get a basic setup working.

Even if you love VS Code, learning to use Vim is useful. It’s almost always available on remote servers, and it’s a better editor than other commonly available editors like nano.

Vim keybindings are also everywhere - you can enable them in the shell with $ set -o vi (instead of the default Emacs bindings), and many programs (IDEs or browsers) have Vim plugins.

Neovim

I started my transition to Neovim in July 2022 - motivated by the Vimscript 9 schism that divided the Vim community.

Transitioning to Neovim after 3 years of Vim was quick - the in-editor experience is very similar.

It took around half a day to convert my .vimrc to a functional Lua based setup, followed by a week or two of tweaking my config and adding plugins.

I was able to bring along all of my Vimscript plugins, which is a huge selling point of Neovim. I do prefer Lua written plugins where possible, but still use many of the same plugins as with Vim - you can find all my Neovim plugins here.

I would strongly recommend Neovim to anyone who is starting out with Vim or to experienced Vim users - it’s great.

Neovim is an improvement over Vim, and has a bubbling, exciting ecosystem of plugins and users. I have found the language servers, linting, formatting and completion experience an improvement over Vim.

If you are a Vim user it’s not a big transition - all of your Vimscript plugins will work as expected.

It’s nice to use Lua for configuration - it’s more flexible and is a more useful, transferable skill that Vimscript.

If you want to get started with Neovim, look at kickstart.nvim.

DVORAK

I started my transition to DVORAK in May 2019 - motivated by a desire to improve the health of my hands.

After two and a half years of programming, I was suffering with muscular soreness & tiredness in my hands. My hands felt fatigued - like they were doing too much work.

I was aware that there were alternative keyboard layouts, designed to be kinder to our hands - a day or two of sporadic, random & repetitive searching on Google about the different options I decided to give Dvorak a shot.

It took me around 2 weeks to get back to a somewhat reasonable level of productivity, but I was not back to the my previous level of QWERTY.

My typing remained inaccurate for a long time - I only felt like I was back to where I was in August 2021 - making the transition over a year long.

Sometimes my typing is still less accurate than it was (particularly as I use a keyboard with blank keycaps) - but it’s manageable. I don’t feel like DVORAK led to a significant productivity improvement.

I would not recommend the DVORAK layout - while I’m glad I have done it and wouldn’t switch back, it takes a long, long time to get used to.

I can still type QWERTY if needed - it’s keyboard & context dependent. I can still type QWERTY on my phone without even realizing it’s a different layout.

Split Keyboard

I started using a split keyboard in July 2021 - motivated by a desire to improve the health of my back.

Previously I had used the Apple Keyboard, then moved to a Vortex Race 3, which I still use today. The split keyboard I use today is the Ergodox EZ.

The main benefit of a split keyboard is that your hands rest further apart. This allows your chest to expand, and reduces the strain on your upper and middle back - in particular reducing pain between the shoulder blades.

It took around 1 week to get back to the same level of productivity as a QWERTY keyboard. I have found a moderate level of productivity increase using a split keyboard.

The Ergodox EZ allows customization the keyboard layout using ORYX - you can find my layout here.

I would recommend a split keyboard, especially if you have back pain in between your shoulder blades - it’s a small amount of time investment for a real health benefit.

I have no problem going back to a normal keyboard - unlike DVORAK, using a split keyboard will not impact your ability to use a normal keyboard.

Thoughts on DVORAK and Vim

The combination of DVORAK and Vim is an interesting one - both are very opinionated about how you should use your keyboard.

I was already a proficient Vim user when I decided to switch to DVORAK.

Foundational to any keyboard layout and Vim is remapping <CAPSLOCK> to <ESCAPE>. In Vim you use the <ESCAPE> key to move from insert to normal mode - easy access to the escape key is essential.

Why DVORAK?

Most computer keyboards are laid out in QWERTY - named for the keys in the first row.

The big idea in Dvorak is the importance of the middle row (also known as the home row).

Vim users know the importance of the home row from hjkl - the keys used for cursor movement in Vim. Dvorak puts all the vowels on the home row - the keys you access the most are closest to your fingers.

The other notable feature of Dvorak is the location of the punctuation characters ' , . - these are located in a prime position.

An interesting thing about learning DVORAK was that the time to learn keys is long tailed. Some (such as , . and aoeu) come very easily, while others like r y f g took a while.

Losing hjkl

In Vim hjkl are used for cursor movement - they are the keys you use to move your cursor around a file in normal mode.

In DVORAK you lose the position and order of hjkl. Initially I considered remapping hjkl to the same position as QWERTY, but decided against it. It’s been fine.

Combinations That Work Great

There are some common Vim key combinations that feel great in Dvorak.

: (Vim command mode) is easy access. :w and :wq feel great - you don’t need to move either hand.

", , and . are easy access. .py are all next to each other. ls is right next to each other.

gcc is easy access and <C-r> requires no hand movement.

Challenges

One challenge is anything g or f related. In Vim gf opens a file under the cursor - as these two keys are next to each other, it requires moving both hands from their natural position.

Another challenge are the {} and [] keys - on a DVORAK layout, these are hard to get at. A split keyboard helps this a lot, as you can put these on the thumb keys.

Summary

Here is a summary of each of the tool and workflow transitions - years of human experience reduced to a Markdown table:

Transition	Time to Transition	Initial Setup Required	Productivity Increase	Health Improvement	Recommended
Atom to Vim	2 weeks	1 day	High	None	✅
Vim to Neovim	1/2 day	1/2 day	Moderate	None	✅
QWERTY to DVORAK	>1 year	None	None	Less hand fatigue	❌
Split Keyboard	3 weeks	1/2 day	Moderate	Back pain relief	✅

Thanks for reading!

Take a look at my dotfiles if you’re interested in my setup - my Lua Neovim config is here.

Measuring Forecast Accuracy with Linear Programming

2023-02-23T00:00:00+00:00

This post introduces a methodology to measure the accuracy of an electricity price forecast using linear programming.

Predictive Accuracy vs. Business Value

The ideal forecast quality measurement directly aligns with a key business metric. Models are not often able to be trained in this way - often models are trained using error measures that will look familiar to anyone who does gradient based optimization, such as mean squared error.

This post uses a linear programming to measure forecast quality in terms of a key business metric - cost.

A battery operating in price arbitrage is optimized using actual prices and forecast prices.

The forecast error can then be quantified by how much money dispatching the battery using the forecast leaves on the table versus dispatching with perfect foresight of prices.

Data

This work uses energy-py-linear for the battery linear program - you can find the code & data in examples/forecast-accuracy.py - the full source code is also available at the bottom of this post.

$ pip install energypylinear

The dataset used is a single sample of the South Australian trading price and the AEMO predispatch price forecast.

Both the price and forecast are supplied by AEMO for the National Electricity Market (NEM) in Australia.

A simple plot of the price and forecast is show below in Figure 1:

Figure 1 - South Australian trading price and predispatch forecast from July 2018.

Method

First we create an instance of the Battery class. We use a large capacity battery so that the battery will chase after all possible arbitrage opportunities with a roundtrip efficiency of 100%.

import energypylinear as epl

asset = epl.battery.Battery(power_mw=2, capacity_mwh=4, efficiency=0.9)

We then dispatch the battery using perfect foresight of prices:

actuals = asset.optimize(
    electricity_prices=data['Trading Price [$/MWh]'],
    freq_mins=30,
)

Next we dispatch using the forecast:

forecasts = asset.optimize(
    electricity_prices=data['Predispatch Forecast [$/MWh]'],
    freq_mins=30,
)

We can then create epl.Account objects to represent the financials for these two simulations.

The trick is using the actuals interval data with the forecast simulation in forecast_account - this evaluates the economics with actual prices but dispatch optimized for forecasts:

#  calculate the variance between accounts
actual_account = epl.get_accounts(actuals.interval_data, actuals.simulation)
forecast_account = epl.get_accounts(actuals.interval_data, forecasts.simulation)
variance = actual_account - forecast_account
print(f"\nforecast error: $ {-1 * variance.cost:2.2f} pct: {100 * variance.cost / actual_account.cost:2.1f} %")

forecast error: $ 92.97 pct: 28.5 %

Discussion

Extend to Different Domains

The method above is specific to using batteries for wholesale price arbitrage.

The idea of using variance between two optimization runs with different inputs can be extended to many business problems.

If there is any error in the optimization (say to a local minima) then the final quality measurement combines the error from both forecasting and from the optimization that used the forecast.

A large capacity battery operating in price arbitrage does somewhat resemble arbitrage of stocks, so the error measurement might be useful for comparing forecasts. It’s less clear how useful this model would be for a temperature prediction.

Negative Value

A challenge with using this measurement of forecast error is what happens when the net benefit of dispatching the battery to a forecast - i.e. when the forecast quality is so bad that using it ends up losing money. Unlike other error measures such as mean squared error it’s not appropriate to simply take the absolute.

Full Example

import io
import pandas as pd
import energypylinear as epl


if __name__ == "__main__":
    #  price and forecast csv data
    data = """
    Timestamp,Trading Price [$/MWh],Predispatch Forecast [$/MWh]
    2018-07-01 17:00:00,177.11,97.58039000000001
    2018-07-01 17:30:00,135.31,133.10307
    2018-07-01 18:00:00,143.21,138.59978999999998
    2018-07-01 18:30:00,116.25,128.09559
    2018-07-01 19:00:00,99.97,113.29413000000001
    2018-07-01 19:30:00,99.71,113.95063
    2018-07-01 20:00:00,97.81,105.5491
    2018-07-01 20:30:00,96.1,102.99768
    2018-07-01 21:00:00,98.55,106.34366000000001
    2018-07-01 21:30:00,95.78,91.82700000000001
    2018-07-01 22:00:00,98.46,87.45
    2018-07-01 22:30:00,91.88,85.65775
    2018-07-01 23:00:00,91.69,85.0
    2018-07-01 23:30:00,101.2,85.0
    2018-07-02 00:00:00,139.55,80.99999
    2018-07-02 00:30:00,102.9,75.85762
    2018-07-02 01:00:00,83.86,67.86758
    2018-07-02 01:30:00,71.1,70.21946
    2018-07-02 02:00:00,60.35,62.151
    2018-07-02 02:30:00,56.01,62.271919999999994
    2018-07-02 03:00:00,51.22,56.79063000000001
    2018-07-02 03:30:00,48.55,53.8532
    2018-07-02 04:00:00,55.17,53.52591999999999
    2018-07-02 04:30:00,56.21,49.57504
    2018-07-02 05:00:00,56.32,48.42244
    2018-07-02 05:30:00,58.79,54.15495
    2018-07-02 06:00:00,73.32,58.01054
    2018-07-02 06:30:00,80.89,68.31508000000001
    2018-07-02 07:00:00,88.43,85.0
    2018-07-02 07:30:00,201.43,119.73926999999999
    2018-07-02 08:00:00,120.33,308.88984
    2018-07-02 08:30:00,113.26,162.32117
    """
    data = pd.read_csv(io.StringIO(data))

    #  battery model
    asset = epl.battery.Battery(power_mw=2, capacity_mwh=4, efficiency=0.9)

    #  optimize for actuals
    actuals = asset.optimize(
        electricity_prices=data["Trading Price [$/MWh]"],
        freq_mins=30,
    )
    #  optimize for forecasts
    forecasts = asset.optimize(
        electricity_prices=data["Predispatch Forecast [$/MWh]"],
        freq_mins=30,
    )

    #  calculate the variance between accounts
    actual_account = epl.get_accounts(actuals.interval_data, actuals.simulation)
    forecast_account = epl.get_accounts(actuals.interval_data, forecasts.simulation)
    variance = actual_account - forecast_account
    print(
        f"\nforecast error: $ {-1 * variance.cost:2.2f} pct: {100 * variance.cost / actual_account.cost:2.1f} %"
    )
    """
    forecast error: $ 92.97 pct: 28.5 %
    """

Summary

This post introduces a method for measuring forecast accuracy using linear optimization of electric battery storage, by looking at the difference between two optimization runs given actual and forecast prices as input.

Thanks for reading!

Mistakes Data Scientists Make

2023-02-22T00:00:00+00:00

Introduction

Patterns exist in the mistakes data scientists make - this article lists some of the most common mistakes data scientists make when learning their craft.

An expert is a person who has made all the mistakes that can be made in a very narrow field.

Niels Bohr

I’ve learnt from all these mistakes - I hope you can learn from them too.

Plot the Target

Prediction separates the data scientist from the data analyst. The data analyst analyzes the past - the data scientist predicts the future.

Using features to predict a target is supervised learning. The target can be either a number (regression) or a category (classification).

Understanding the distribution of the target is a must-do for any supervised learning project.

The distribution of the target will inform many decisions a data scientist makes, including:

what models to consider using
whether scaling is required
if the target has outliers that should be removed
if the target is imbalanced

Regression

In a regression problem, a data scientist wants to know the following about the target:

the minimum & maximum
how normally distributed the target it
if the distribution is multi-modal
if there are outliers

A histogram will answer all of these - making it an excellent choice for visualizing the target in regression problems.

The code below generates a toy dataset of four distributions and plots a histogram:

import numpy as np
import pandas as pd

data = np.concatenate([
    np.random.normal(5, 1, 10000),
    np.random.normal(-5, 1, 10000),
    np.array([-20, 20] * 1000)
])
ax = pd.DataFrame(data).plot(kind='hist', legend=None, bins=100)

The histogram shows the two normal and two uniform distributions that generated this dataset.

Classification

In a classification problem, a data scientist wants to know the following about the target:

how many classes there are
how balanced are the classes

We can answer these questions using a single bar chart:

import pandas as pd

data = ['awake'] * 1000 + ['asleep'] * 500 + ['dreaming'] * 50
ax = pd.Series(data).value_counts().plot(kind='bar')

The bar chart shows us we have three classes, and shows our dreaming class is under-represented.

Dimensionality

Dimensionality provides structure for understanding the world. An experienced data scientist learns to see the dimensions of data.

The Value of Low Dimensional Data

In business, lower dimensional representations are more valuable than high dimensional representations. Business decisions are made in low dimensional spaces.

Notice that much of the work of a data scientist is using machine learning to reduce dimensionality:

using pixels in an satellite image to predict solar power output,
using wind turbine performance data to estimate the probability of future breakdown,
using customer data to predict customer lifetime value.

Each of the outputs can be used by a business ways the raw data can’t. Unlike their high dimensional raw data inputs, the lower dimensional outputs can be used to make decisions:

solar power output can be used to guide energy trader actions,
a high wind turbine breakdown probability can lead to a maintenance team being sent out,
a low customer lifetime estimation can lead to less money budgeting for marketing.

The above are examples of the interaction between prediction and control. The better you are able to predict the world, the better you can control it.

This is also a working definition of a data scientist - making predictions that lead to action - actions that change how a business is run.

The Challenges of High Dimensional Data

The difficulty of working in high dimensional spaces is known as the curse of dimensionality.

To understand the curse of dimensionality we need to reason about the space and density of data. We can imagine a dense dataset - a large number of diverse samples within a small space. We can also imagine a sparse dataset - a small number of samples in a large space.

What happens to the density of a dataset as we add dimensions? It becomes less dense, because the data is now more spread out.

However, the decrease of data density with increasing dimensionality is not linear - it’s exponential. The space becomes exponentially harder to understand as we increase dimensions.

Why is the increase exponential? Because this new dimension needs to be understood not only in terms of the each other dimension (which would be linear) but in terms of the combination of every other dimension with every other dimension (which is exponential).

This is the curse of dimensionality - the exponential increase of space as we add dimensions. The code below show this effect:

import itertools

def calc_num_combinations(data):
    return len(list(itertools.permutations(data, len(data))))

def test_calc_num_combinations():
    """To test it works :)"""
    test_data = (((0, ), 1), ((0, 1), 2), ((0, 1, 2), 6))
    for data, length in test_data:
        assert length == calc_num_combinations(data)

test_calc_num_combinations()
print([(length, calc_num_combinations(range(length))) for length in range(11)])
"""
[(0, 1),
 (1, 1),
 (2, 2),
 (3, 6),
 (4, 24),
 (5, 120),
 (6, 720),
 (7, 5040),
 (8, 40320),
 (9, 362880),
 (10, 3628800)]
"""

The larger the size of the space, the more work a machine learning model needs to do to understand it.

This is why adding features with no signal is painful. Not only does the model need to learn it’s noise - it needs to do this by considering how this noise interacts with each combination of every other column.

Applying the Curse of Dimensionality

Getting a theoretical understanding of dimensionality is step one. Next is applying it in the daily practice of data science. Below we will go through a few practical cases where data scientists can not apply the curse of dimensionality to their own workflow.

Too Many Hyperparameters

Data scientists can waste time doing excessive grid searching - expensive in both time and compute. The motivation of complex grid searches come from a good place - the desire for good (or even perfect) hyperparameters.

Yet we now know that adding just one additional search means an exponential increase in models trained - because this new search parameter needs to be tested in combination with every other search parameter.

Another mistake is narrow grid searches - searching over small ranges of hyperparameters. A logarithmic scale will be more informative than a small linear range:

#  this search isn't wide enough
useless_search = sklearn.model_selection.GridSearchCV(
    sklearn.ensemble.RandomForestRegressor(n_estimators=10), param_grid={'n_estimators': [10, 15, 20]
)

#  this search is more informative
useful_search = sklearn.model_selection.GridSearchCV(
    sklearn.ensemble.RandomForestRegressor(n_estimators=10), param_grid={'n_estimators': [10, 100, 1000]
)

one to compare different models (using the best hyperparameters found so far for each)
one to compare different hyperparameters for a single model

I’ll start by comparing models in the first pipeline, then doing further tuning on a single model in the second grid search pipeline. Once a model is reasonably tuned, it’s best hyperparameters can be put into the first grid search pipeline.

The fine tuning on a single model is often searches over a single parameter at a time (two maximum). This keeps the runtime short, and also helps to develop intuition about what effect changing hyperparameters will have on model performance.

Too Many Features

A misconception I had as a junior data scientist was that adding features had no cost. Put them all in and let the model figure it out! We can now easily see the naivety of this - more features has as exponential cost.

This misconception came from a fundamental misunderstanding of deep learning.

Seeing the results in computer vision, where deep neural networks do all the work of feature engineering from raw pixels, I thought that the same would be true of using neural networks on other data. I was making two mistakes here:

not appreciating the useful inductive bias of convolutional neural networks
not appreciating the curse of dimensionality

We know now there is an exponential cost to adding more features. This also should change how you look at one-hot encoding, which dramatically increases the space that a model needs to understand, with low density data.

Too Many Metrics

In data science projects, performance is judged using metrics such as training or test performance.

In industry, a data scientist will choose metrics that align with the goals of the business. Different metrics have different trade-offs - part of a data scientists job is to select metrics that correlate best with the objectives of the business.

However, it’s common for junior data scientists to report a range of different metrics. For example, on a regression problem they might report three metrics:

mean absolute error
mean absolute percentage error
root mean squared error

Combine this with reporting a test & train error (or test & train per cross validation fold), the number of metrics becomes too many to glance at and make decisions with.

Pick one metric that best aligns with your business goal and stick with it. Reduce the dimensionality of your metrics so you can take actions with them.

Too Many Models

Data scientists are lucky to have access to many high quality implementations of models in open source packages such as scikit-learn.

This can become a problem when data scientists repeatedly train a suite of models without a deliberate reason why these models should be looked at in parallel. Linear models are trained over and over, without ever seeing the light outside a notebook.

Quite often I see a new data scientist train a linear model, an SVM and a random forest. An experienced data scientist will just train a tree based ensemble (a random forest or XGBoost), and focus on using the feature importances to either engineer or drop features.

Why is are tree based ensembles a good first model? A few reasons:

they can be used for either regression or classification,
no scaling of target or features required,
training can be parallelized across CPU cores,
they perform well on tabular data,
feature importances are interpretable.

Learning Rate

If there is one hyperparameter worthy of searching over when training neural networks it is learning rate (second is batch size). Setting the learning rate too high will make training of neural networks unstable - LSTM’s especially. What the learning rate does is quite intuitive - higher learning rate means faster training.

Batch size is less intuitive - a smaller batch size will mean high variance gradients, but some of the value of batches is using that variance to break out of local minima. In general, batch size should be as large as possible to improve gradient quality - often it is limited by GPU memory.

Where Error Comes From

Three sources of error are:

sampling error - using statistics estimated on a subset of a larger population,
sampling bias - samples having different probabilities than others,
measurement error - difference between measurement & true value.

Actually quantifying these is challenging, often impossible. However there is still value in thinking qualitatively about the sampling error, sampling bias or measurement error in your data.

Another useful concept is independent & identically distributed (IID). IID is the assumption that data is:

independently sampled (no sampling bias),
identically distributed (no sampling or measurement error).

It’s an assumption made in statistical learning about the quality of the distribution and sampling of data - and it’s almost always broken.

Thinking about the difference between the sampling & distribution of your training and test can help improve the generalization of a machine learning model, before it’s failing to generalize in production.

Bias & Variance

Prediction error of a supervised learning model has three components - bias, variance and noise.

Bias is a lack of signal - the model misses seeing relationships that can be use to predict the target. This is underfitting. Bias can be reduced by increasing model capacity (either through more layers / trees, a different architecture or more features).

Variance is confusing noise for signal - patterns in the training data that will not appear in the data at test time. This is overfitting. Variance can be reduced by adding training data.

Noise is unmanageable - the best a model can do is avoid it.

The error of a machine learning model is usually due to a combination of all three. Often data scientists will be able to make changes that lead to a trade off between bias & variance. Three common levers a data scientist can pull are:

adding model capacity,
reducing model capacity,
adding training data.

Adding Model Capacity

Increasing model capacity will reduce bias, but can increase variance (that additional capacity can be used to fit to noise).

Reducing Model Capacity

Decreasing model capacity (through regularization, dropout or a smaller model) will reduce variance but can increase bias.

Adding Data

More data will reduce variance, because the model has more examples to learn how to separate noise from signal.

More data will have no effect on bias. More data can even make bias worse, if the sampling of additional is biased (sampling bias).

Additional data sampled with bias will only give your model the chance to be more precise about being wrong - see Chris Fonnesbeck’s talk on Statistical Thinking for Data Science for more on the relationship between bias, sampling bias and data quantity.

Width & Depth of Neural Nets

The reason why junior data scientists obsess over the architecture of fully connected neural networks comes from the process of building them. Constructing a neural network requires defining the architecture - surely it’s important?

Yet when it comes to fully connected neural nets, the architecture isn’t really important.

As long as you give the model enough capacity and sensible hyperparameters, a fully connected neural network will be able to learn the same function with a variety of architectures. Let your gradients work with the capacity you give them.

Case in point is Trust Region Policy Optimization, which uses a simple feedforward neural network as a policy on locomotion tasks. The locomotion tasks use a flat input vector, with a simple fully connected architecture.

Schulman et al. (2015) Trust Region Policy Optimization

The correct mindset with a fully connected neural network is a depth of two or three, with the width set between 50 to 100 (or 64 to 128, if you want to fit in with the cool computer science folk). If your model is low bias, consider adding capacity through another layer or additional width.

One interesting improvement on the simple fully connected architecture is the wide & deep architecture, which mixes wide memorization feature interactions with deep unseen, learned feature combinations.

Cheng et al. (2016) Wide & Deep Learning for Recommender Systems

PEP 8

Programs must be written for people to read, and only incidentally for machines to execute.

Abelson & Sussman - Structure and Interpretation of Computer Programs

Code style is important. I remember being confused at why more experienced programmers were so particular about code style.

After programming for five years, I now know where they were coming from.

Code that is laid out in the expected way requires less effort is required to read & understand code. Poor code style places additional burden on the reader to understand your unique code style, before they even think about the actual code itself.

#  bad
Var=1

def adder ( x =10 ,y= 5):
    return  x+y

#  good
var = 1

def adder(x=10, y=5):
    return x + y

All good text editors will have a way to integrate in-line linting - highlighting mistakes as you write them. Automatic, in-line linting is the best way to learn code style - take advantage of it.

Drop the Target

If you ever get a model with an impossibly perfect performance, it is likely that your target is a feature.

#  bad
data.drop('target', axis=1)

#  good
data = data.drop('target', axis=1)

We all do it once.

Scale the Target or Features

This is the advice I’ve given most when debugging machine learning projects. Whenever I see a high loss (higher that say 2 or 3), it’s a clear sign that the target has not been scaled to a reasonable range.

Scale matters because unscaled targets lead to large prediction errors, which mean large gradients and unstable learning.

By scaling, I mean either standardization:

standardized = (data - np.mean(data)) / np.std(data)

Or normalization:

normalized = (data - np.min(data)) / (np.max(data) - np.min(data))

Note that there is a lack of consistency between what these things are called - normalization is also often called min-max scaling, or even standardization!

Take the example below, where we are trying to predict how many people attend a talk, from the number of speakers and the start time. Our first pipeline doesn’t scale the features or targets, leading to a large error signal and large gradients:

Our second pipeline takes the time to properly scale features & target, leading to an error signal with appropriately sized gradients:

A similar logic holds for features - unscaled features can dominate and distort how information flows through a neural network.

Work with a Sample

This is a small workflow improvement that leads to massive productivity gains.

Development is a continual cycle of fixing errors, running code and fixing errors. Developing your program on a large dataset can cost you time - especially if your debugging something that happens at the end of the pipeline.

During development, work on a small subset of the data. There are a few ways to handle this.

Creating a Subset of the Data

You can work on a sample of your data already in memory, using an integer index:

data = data[:1000]

pandas allows you only load a subset of the data at a time (avoiding pulling the entire dataset into memory):

data = pd.read_csv('data.csv', nrows=1000)

Controlling the Debugging

A simple way to control this is a variable - this is what you would do in a Jupyter Notebook:

nrows = 1000
data = pd.read_csv('data.csv', nrows=nrows)

Or more cleanly with a command line argument:

#  data.py
parser.add_argument('--nrows', nargs='?')
args = parser.parse_args()
data = pd.read_csv('data.csv', nrows=args.nrows)
print(f'loaded {data.shape[0]} rows')

Which can be controlled when running the script data.py:

$ python data.py --nrows 1000

Don’t Write over Raw Data

Raw data is holy - it should never be overwritten. The results of any data cleaning should be saved separately to the raw data.

Use $HOME

This one is a pattern that has dramatically simplified my life.

Managing paths in Python can be tricky. There are few things that can change how path finding Python can work:

where the user clones source code,
where a virtual environment installs that source code,
which directory a user runs a script from.

Some of the problems that occur are from these changes:

os.path.realpath will change based on where the virtual environment installs your package,
os.getcwd will change based on where the user runs Python the interpreter.

Putting data in a fixed, consistent place can avoid these issues - you don’t ever need to get the directory relative to anything except the users $HOME directory.

The solution is to create a folder in the user’s $HOME directory, and use it to store data:

import os

home = os.environ['HOME']
path = os.path.join(home, 'adg'))
os.makedirs(path, exist_ok=True)
np.save(path, data)

This means your work is portable - both to on your colleague’s laptops and on remote machines in the cloud.

Thanks for reading!

Daniel C. Dennett’s Four Competences

2023-02-21T00:00:00+00:00

In From Bacteria to Bach and Back Daniel C. Dennett introduces four grades of competence.

They describe four progressively competent intelligences. Each competence learns through iterative application of trail and error learning.

The four competences are an invaluable idea for understanding computational control algorithms.

They organize computational control algorithms by asymptotic performance and sample efficiency - the least efficient algorithms have lower limits on performance.

What is Competence?

Competence is the ability to act well. It is the ability of an agent to interact with its environment to achieve goals.

Competence can be contrasted with comprehension, which is the ability to understand. Together both form a useful decomposition of intelligence.

Competence allows an agent to do control - to interact with a system and produce a desired outcome.

Evolutionary Learning

Maybe it would be good for hackers to act more like painters, and regularly start over from scratch

Paul Graham

Evolutionary learning is trial and error learning.

It is iterative improvement using a generate, test, select loop:

generate a population, using information from previous steps
test the population through interaction with the environment
select population members of the current generation for the next step

It is the driving force in our universe and is substrate independent. It occurs in biological evolution, business, training neural networks, and personal development.

There is much to learn from evolutionary learning:

failure at a low-level driving improvement at a higher level
the effectiveness of iterative improvement
the need of a dualistic (agent and environment) view for it to work, at odds with the truth of non-duality

These are lessons to explore another time - for now, we are focused on the four grades of competence.

Comparing Competence

There several metrics we can use to compare our intelligent agents.

Asymptotic performance measures how an agent performs given unlimited opportunity to sample experience. It is how good an agent can be in the limit and improves as our agent gains more complex competences.

Sample efficiency measures how much experience an agent needs to achieve a level of performance. This also improves as our agents get more complex. The importance of sample efficiency depends on compute cost. If compute is cheap, you care less about sample efficiency.

Each of the four agents interacts with the same environment. Interacting with the agent allows an agent to generate data through experience. What the agent does with this data determines how much data it needs. The more an agent squeezes out of each interaction, the less data required.

The Four Competences

The four competences are successive applications of evolutionary learning - this means that each agent has all the abilities that the less competent agent had.

1. Darwinian Competence

The Darwinian agent has pre-designed and fixed competences - it doesn’t improve within it’s lifetime.

Improvement happens globally via selection that aggregates across the agent’s entire lifetime.

Biological examples include bacteria and viruses. Computational examples include CEM, evolutionary algorithms such as CMA-ES or genetic algorithms.

2. Skinnerian Competence

The Skinnerian agent improves its behaviour by learning to responding to reinforcement. It can improve within it’s lifetime by learning how to map states and actions to reward signals, such as food or dopamine.

Biological examples include neurons and dogs. Computational examples include model-free reinforcement learning, such as DQN or Rainbow. The GPT series of language models has Skinnerian competence.

3. Popperian Competence

The Popperian agent learns models of its environment - improvement occurs by offline testing of plans with its environment model.

Biological examples include crows and primates. Computational examples model-based reinforcement learning such as AlphaZero or World Models and classical optimal control.

4. Gregorian Competence

The Gregorian agent builds thinking tools, such as arithmetic, constrained optimization, democracy, and computers. Improvement occurs via systematic exploration and higher-order control of mental searches.

The only biological example we have of a Gregorian intelligence is humans. I do not know of a computational method that builds it’s own thinking tools. Now we have introduced our four agents we can compare them.

Comparing the Four Competences

Darwinian agents improve through selection determined by a single number. For biological evolution, this is how many times an animal has mated.

For computational evolution, this is a fitness, such as average reward per episode. These are both weak learning signals. This accounts for the poor sample efficiency of agents with Darwinian competences.

Compare this with the Skinnerian agent, which can improve both through selection and reinforcement. Being able to respond to reinforcement allows within lifetime learning. It has the ability to learn from the temporal structure of the environment. The Skinnerian agent uses this data to learn functions that predict future rewards.

The Popperian agent can further improve within its lifetime by learning models of its world. Generating data from these models can be used for planning, or to produce low dimensional representations of the environment.

Summary

Daniel C. Dennett’s four grades of competence describe four progressively competent intelligences, that each learn through successive applications of trial and error learning.

It allows understanding of the asymptotic performance and sample efficiency of learning algorithms and highlights two useful dimensions of intelligent agents - what data they use and what they learn from this data.

Of the most competent of our agents, humans are the only biological examples. We have no computational examples.

Thanks for reading!

Space Between Money and the Planet

2023-02-08T00:00:00+00:00

This study proposes the existence of a tradeoff between monetary gain and carbon emissions reduction in the dispatch of electric batteries for arbitrage.

Supporting materials for this work are in adgefficiency/space-between-money-and-the-planet.

A focus on economic profit is demonstrated to not result in maximum carbon savings. A focus only on wholesale prices often removes the entire carbon benefit and leads to a carbon emissions increase.

A calculation of the breakeven carbon price necessary to remove the tradeoff between prices and carbon is performed. This carbon price represents the price needed to align the world where we optimize for monetary gain with the world where we prioritize carbon reduction.

The calculation of the breakeven carbon price provides an estimate of the market correction required to reconcile the conflicting objectives of financial and environmental performance in the dispatch of electric batteries for arbitrage.

Motivation

The importance of battery storage

Battery storage is a key technology of the clean energy transition. Batteries enable low carbon, intermittent renewable generation to replace dirty electricity.

Batteries pose a different set of control problems than other key energy transition technologies like solar or wind.

A battery makes decisions to charge or discharge based on an imperfect view of the world, with competing objectives and value streams.

Once a wind turbine or solar panel is built, operating that asset is straightforward - you generate as much as you can based on the amount of wind or sun available at that moment. There is no decision to make or opportunity cost to trade off - when the resource is available, you use as much as possible.

Arbitrage of money and carbon

A common battery operation stragety is arbitrage - the movement of electricity between periods of high and low value.

In the price arbitrage scenario, a battery wants to purchase cheap electricity and sell it at a higher price. A battery that does the opposite, that charges when electricity prices are high and discharges when they are low, will lose money.

A battery that charges with dirty electricity and discharges when electricity is clean increases carbon emissions. Charging increases the load on a dirtier generator, while discharging decreases the load on a cleaner generator.

Tradeoff between profit maximization and emissions minimization

Operating a battery requires making decisions to achieve a goal. Two natural goals for a battery are to maximize profit or save carbon.

A central point of this work is that we cannot rely only on optimization driven only by price signals to maximize carbon savings.

This view was shared in 2022 by The Economist:

Many funds claim that there is no trade-off between maximising profits and green investing, which seems unlikely for as long as the externalities created by polluting firms are legal and untaxed.

The ‘just make money’ fallacy

In my career I’ve personally held and often encountered the following perspective:

Environmentally effective climate action must be economically effective - we need to make money in order to save the planet.

It’s often backed up with the view that renewables are low variable cost generators, able to bid into electricity markets at lower prices than high variable cost generators (like gas and coal).

This viewpoint (and viewpoints similar to it) are convenient - just make money, ignore the carbon side and you are also saving the planet.

Methods

Experiment source code is here.

Experiment design

Join raw price and carbon intensity data.
Simulate battery with objectives of: a. profit maximization, b. carbon emissions minimization,
Compare the economic and carbon benefits of the two objective.

Re-run the experiment

Requires Python 3.10+ - the command make results will re-run the entire experiment including downloading & joining the raw data and running the simulations for price and carbon objectives:

$ git clone https://github.com/ADGEfficiency/space-between
$ cd space-between
$ make results

Signals and worlds

The key idea in the methodology is to take the difference between two worlds - a world where we optimize for money, and a world where we optimize for price.

In an ideal world, we would be able to operate a battery to both make money and save carbon at the same time. If clean electricity is cheap and dirty electricity is expensive, we can operate our battery to make money, and know that we will also be saving carbon.

In the opposite world, where dirty electricity is cheap and clean electricity is expensive, there is an opportunity cost to saving carbon. There would be situations where you would need to reduce the environmental benefit of operating your battery in order to make more money.

Below is a scenario where there is an opportunity cost to saving carbon. We can measure the delta between these two worlds in terms of the two things we care about - money and carbon.

Choosing to prioritize money over carbon means we make $150 more than if we optimized for carbon, but we generate 10 tC more than if we optimized for carbon:

	Optimize for Money	Optimize for Carbon	Delta
Money saved $	200	50	150
Carbon saved tC	10	20	10
		Carbon Price $/tC	15

Looking at the delta between our two worlds allows us to calculate a carbon price of 15 $/tC. This carbon price is the ratio of money gained by optimizing for money to the carbon saving gained by optimizing for carbon.

We would be giving the market $150 to balance out what we lose when optimizing for carbon, and receive 10 tC of carbon savings in for our lost money.

This carbon price would be applied in proportion to the carbon intensity of the electricity produced by each market participant.

This price estimates the level of support (via a revenue neutral carbon tax on electricity market participants - of course!) required to counteract the misalignment between the price and carbon signals and worlds.

Data

This study uses data from the Australian National Electricity Market (NEM) from 2014 to end of 2022.

This experiment uses two signals as input interval data - a price signal and a carbon signal.

The price signal is the 5 minute dispatch prices in South Australia. This is a slightly different dataset than the trading price. Dispatch prices were chosen so that the prices (before and after the transition from 30 to 5 minute trading price settlement) is on the same frequency (5 minutes per interval) as the carbon intensity data.

The carbon signal is the 5 minute NEMDE data and NEM generator carbon intensity in South Australia. The NEMDE dataset has data on the marginal carbon generators, which allows calculation of a marginal carbon intensity.

Dependencies

The main third-party Python dependencies of this work are pandas for data processing, matplotlib for plotting and pulp for linear program solving.

This work depends on nem-data - a Python CLI for downloading Australian electricity market data:

import nemdata

data = nemdata.download(start="2020-01", end="2020-02", table="trading-price")

This work depends on energy-py-linear - a Python library for optimizing the dispatch of energy assets for profit maximization and carbon emissions reduction:

import energypylinear as epl

#  2.0 MW, 4.0 MWh battery
asset = epl.battery.Battery(power_mw=2, capacity_mwh=4, efficiency=1.0)

results = asset.optimize(
  electricity_prices=[100.0, 50, 200, -100, 0, 200, 100, -100], freq_mins=5
)

Battery model

The battery model is a mixed-integer linear program built in PuLP. It optimizes the charge and discharge of a battery with perfect foresight of future prices and marginal carbon intensities. The roundtrip efficiency of the battery is set at 100%.

The only value stream available to the battery is the arbitrage of electricity or carbon from one interval to another. The battery is optimized in monthly blocks with interval data on a 5 minute frequency.

Results

Download previously generated results with Python 3 using make pulls3:

$ git clone https://github.com/ADGEfficiency/space-between
$ cd space-between 
$ make pulls3

This pull previously generated results from S3 using the AWS CLI into ./data:

$ tree -L 3 ./data
├── database.sqlite
├── dataset.parquet
└── results
    ├── 08cdcee2-a315-49d8-9207-820a5ad4a0de
    │   ├── input-interval-data.parquet
    │   ├── interval-data.parquet
    │   ├── meta.json
    │   └── simulation.parquet
    ├── 0bd87681-d422-491c-9ad2-3afc1503ab6f
    │   ├── input-interval-data.parquet
    │   ├── interval-data.parquet
    │   ├── meta.json
    │   └── simulation.parquet
    ...
    └── fab33244-fc60-456a-8377-f5f73c2700d7
        ├── input-interval-data.parquet
        ├── interval-data.parquet
        ├── meta.json
        └── simulation.parquet

Optimize for price or carbon

The battery model was optimized on one of two objectives - either price or carbon.

Optimizing for price means the battery will import electricity from the grid at low prices and export it during high prices, leading to an economic saving.

Optimizing for carbon means the battery will import electricity from the grid at low marginal carbon intensity and export it during high marginal carbon intensity, leading to a carbon saving.

Below we compare the optimization of battery for these two objectives - the left optimizes a battery for money, on the right optimizing a battery for carbon:

Comparing the optimization for price (left) and carbon (right).

We can observe the full use of the battery charge in both the price and carbon arbitrage simulations.

Monthly profit and emissions benefits

We can look at how our simulations are performing across the entire experiment by grouping our simulations by month.

A negative benefit is a loss. Negative profit means losing money, negative carbon benefit means increasing carbon emissions.

The chart below shows the price & carbon benefit from optimizing our battery for price and carbon for each month:

Monthly price & carbon benefits when optimizing for price (left) and carbon (right) from 2014 to end of 2020.

The table below summarize the data across the entire experiment:

objective	negative_profit	negative_emissions_benefit	months
carbon	73.1481	0	108
price	0	87.037	108

When we optimize for money, we have a negative effect on the environment 87% of the time. When we optimize for carbon, we will lose money 84.5% of the time.

These results are dramatic - changing our objective can often completely remove the benefit we see for the alternate objective.

Monthly carbon price

What we are interested in is how these two simulations change together - by taking the difference between the two simulations (one for money, the other for carbon), we can measure how far the space is between them.

The chart below shows the data grouped by month, but this time only shows the delta between our two worlds:

Monthly deltas from 2014 to end of 2020.

The three deltas shown above are:

price delta - the difference between the optimize for money and optimize for carbon worlds in thousands of Australian dollars per month,
carbon delta - the difference between the optimize for money and optimize for carbon worlds in term of tons of carbon savings per month,
monthly carbon price - the ratio of our price to carbon deltas.

Annual carbon price

The final chart shows the delta between worlds results grouped by year:

Annual deltas from 2014 to end of 2020.

We can observe a few things from the chart above:

a carbon price of below 80 $/tC would fully correct for the misalignment between price and carbon signals in all years except 2022,
2022 is an outlier due to both an increased price delta (meaning the electricity market was more valuable for batteries) and a lower carbon delta (due to cleaner electricity).

Discussion

Exploring carbon prices

A key result of this work is the estimation of the breakeven carbon intensity between our two simulated worlds.

A system where our deltas are $500 and 50 tC results in a carbon price of $/tC 10.

This carbon prices implies that if we adjust our market by collecting this $500 through a carbon price applied to all generation, we could incentivize lower carbon generation to be more competitive at the margin.

This carbon price is a break-even carbon price for the battery - it is what we would have to pay the market to offset the lost revenue of $500.

More Output Metrics

This study stops with the calculation of a carbon delta, which is reducing over time. This means that even if the carbon price was increasing, the total cost may be decreasing. The total cost is the carbon delta multiplied by the breakeven carbon price.

Effect of efficiency & forecast error on carbon price

The optimization done in this work is with perfect foresight. Optimizing with perfect foresight allows us to put an upper limit on both money and carbon savings. In reality, a battery will be operated with imperfect foresight of future prices.

Because we are interested in the ratio between carbon & economic savings, taking the ratio of maximum carbon to maximum economic savings is hopefully useful. The assumption is that the relative dispatch error (in % lost carbon or money) is the same for both objectives.

Data

This study uses the 5 minute South Australia dispatch price and the 5 minute NEMDE data for a carbon signal.

Using different price and carbon signals will change the results of this study - this isn’t a fatal criticism but it should reinforce that this study is heavily dependent on the choice of data.

We can add to this the generic but always relevant criticism of anything empirical - you can’t use the past to predict the future.

Marginal versus average carbon intensity

The intensity from the NEMDE data is a marginal intensity, supplied by the NEMDE solver as the slack variable for increasing demand. By using this signal we are assuming that any actions we took would not change how the market is dispatched - this will be true up to a point (the size of the marginal bid).

The marginal carbon intensity is different from the more commonly reported average carbon intensity. It would be interesting to compare these results with different carbon signals.

It does introduce the question of which intensity is relevant for the accounting.

Battery model

The battery model applies a constant roundtrip efficiency onto battery export - in reality efficiency is a non-linear function of state of charge, battery age and temperature.

This study uses a battery configuration of 2 MW power rating with 4 MWh of capacity - other batteries have different ratios of power to energy.

Single value stream

Batteries often have access to many value streams, such as network charge savings or grid frequency services. This experiment only considers the arbitrage of wholesale electricity.

Including other value streams will change the size of the delta between our two worlds.

Thanks for reading!

If you enjoyed the content of post, check out Measuring Forecast Quality using Linear Programming, which uses a linear programming battery model to measure the quality of a forecast.

If you enjoyed the style of this post, check out Typical Year Forecasting of Electricity Prices , which shows how to create a low variance forecasts and estimates of energy project performance.

Supporting materials for this work are in adgefficiency/space-between-money-and-the-planet.

@article{green2023spacebetween,
  title   = "The Space Between Money and the Planet",
  author  = "Green, Adam Derek",
  journal = "adgefficiency.github.io",
  year    = "2023",
  url     = "https://adgefficiency.com/space-between-money-and-the-planet/"
}

Introducing energy-py-linear

2023-01-30T00:00:00+00:00

This post introduces energy-py-linear - a Python library for optimizing energy assets using mixed integer linear programming (MILP).

Why Linear Programming?

Linear programming is a popular choice for solving many energy industry problems - many energy systems can be modelled as linear, and suitable for optimization using linear solvers.

Linear models have the quality that if a feasible solution exists, it exists on the boundary of a constraint. This makes solving linear programs fast in practice. The optimization itself is also deterministic - it doesn’t rely on randomness like gradient descent.

What can `energypylinear` do?

optimize the dispatch of electric batteries, electric vehicle charging and gas fired CHP generators,
optimize for either price or carbon,
calculate the variance between two simulations.

You can find the source code for energypylinear at ADGEfficiency/energy-py-linear.

A Guide to Deep Learning Layers

2023-01-23T00:00:00+00:00

Summary

Layer	Intuition	Inductive Bias	When To Use
Fully connected	Allow all possible connections	None	Data without structure (tabular data)
2D convolution	Recognizing spatial patterns	Local, spatial patterns	Data with spatial structure (images)
LSTM	Database	Sequences & memory	Never - use Attention
Attention	Focus on similarities	Similarity & limit information flow	Data with sequential structure

Introduction

This post is about four fundamental neural network layer architectures - the building blocks that machine learning engineers use to construct deep learning models.

The four layers are:

the fully connected layer,
the 2D convolutional layer,
the LSTM layer,
the attention layer.

For each layer we will look at:

how each layer works,
the intuition behind each layer,
the inductive bias of each layer,
what the important hyperparameters are for each layer,
when to use each layer,
how to program each layer in TensorFlow 2.0.

All code examples are built using tensorflow==2.2.0 using the Keras Functional API.

Background - what is inductive bias?

A key term in this article is inductive bias - a useful term to sound clever and impress your friends.

Inductive bias is the hard-coding of assumptions into the structure of a learning algorithm. These assumptions make the method more special purpose, less flexible but more useful. By hard coding in assumptions about the structure of the data & task, we can learn functions that we otherwise can’t.

Examples of inductive bias in machine learning include margin maximization (classes should be separated by as large a boundary as possible - used in Support Vector Machines) and nearest neighbours (samples close together in feature space are in the same class - used in the k-nearest neighbours algorithm).

A bit of bias is good - this is a common lesson in machine learning (bias can be traded off for variance). This also holds in reinforcement learning, where unbiased approxmiations of a high variance Monte Carlo return performs worse than bootstrapped temporal difference methods.

1. The Fully Connected Layer

The fully connected layer is the most general purpose deep learning layer.

Also known as a dense or feed-forward layer, this layer imposes the least amount of structure of our layers. It will be found in almost all neural networks - if only used to control the size & shape of the output layer.

How does the fully connected layer work?

At the heart of the fully connected layer is the artificial neuron - the distant ancestor of McCulloch & Pitt’s Threshold Logic Unit of 1943.

The artificial neuron is inspired by the biological neurons in our brains - however an artificial neuron is a shallow approximation of the complexity of a biological neuron.

The artificial neuron composed of three sequential steps:

weighted linear combination of inputs,
sum across weighted inputs,
activation function.

1. Weighted linear combination of inputs

The strength of the connection between nodes in different layers are controlled by weights - the shape of these weights depending on the number of nodes layers on either side. Each node has an additional parameter known as a bias, which can be used to shift the output of the node independently of it’s input.

The weights and biases are learnt - commonly in modern machine learning backpropagation is used to find good values of these weights - good values being those that lead to good predictive accuracy of the network on unseen data.

2. Sum across all weighted inputs

After applying the weight and bias, all of the inputs into the neuron are summed together to a single number.

3. Activation function

This is then passed through an activation function. The most important activation functions are:

linear activation function - unchanged output,
ReLu - $0$ if the input is negative, otherwise input is unchanged
Sigmoid squashes the input to the range $(0, 1)$
Tanh squashes the input to the range $(-1, 1)$

The output of the activation function is input to all neurons (also known as nodes or units) in the next layer.

This is where the fully connected layer gets it’s name from - each node is fully connected to the nodes in the layers before & after it.

A single neuron with a ReLu activation function

For the first layer, the node gets it’s input from the data being fed into the network (each data point is connected to each node). For last layer, the output is the prediction of the network.

The fully connected layer

What is the intuition & inductive bias of a fully connected layer?

The intuition behind all the connections in a fully connected layer is to put no restriction on information flow. It’s the intuition of having no intuition.

The fully connected layer imposes no structure and makes no assumptions about the data or task the network will perform. A neural network built of fully connected layers can be thought of as a blank canvas - impose no structure and let the network figure everything out.

Universal Approximation (Except in Practice)

This lack of structure is what gives neural networks of fully connected layers (of sufficient depth & width) the ability to approximate any function - known as the Universal Approximation Theorem.

The ability to learn any function at first sounds attractive. Why do we need any other architecture if a fully connected layer can learn anything?

Being able to learn in theory does not mean we can learn in practice. Actually finding the correct weights, using the data and learning algorithms (such as backpropagation) we have available may be impractical and unreachable.

The solution to these practical challenges is to use less specialized layers - layers that have assumptions about the data & task they are expected to perform. This specialization is their inductive bias.

When should I use a fully connected layer?

A fully connected layer is the most general deep learning architecture - it imposes no constraints on the connectivity of each laver.

Use it when your data has no structure that you can take advantage of - if your data is a flat array (common in tabular data problems), then a fully connected layer is a good choice. Most neural networks will have fully connected layers somewhere.

Fully connected layers are common in reinforcement learning when learning from a flat environment observation.

For example, a network with a single fully connected layer is used in the Trust Region Policy Optimization (TRPO) paper from 2015:

A fully connected layer being used to power the reinforcement learning algorithm TRPO

Fully connected layers are common as the penultimate & final layer as fully connected on convolutional neural networks performing classification. The number of units in the fully connected output layer will be equal to the number of classes, with a softmax activation function used to create a distribution over classes.

What hyperparameters are important for a fully connected layer?

The two hyperparameters you’ll often set in a fully connected layer are the:

number of nodes,
activation function.

A fully connected layer is defined by a number of nodes (also known as units), each with an activation function. While you could have a layer with different activation functions on different nodes, most of the time each node in a layer has the same activation function.

For hidden layers, the most common choice of activation function is the rectified-linear unit (the ReLu). For the output layer, the correct activation function depends on what the network is predicting:

regression, target can be positive or negative -> linear (no activation),
regression, target can be positive only -> ReLu,
classification -> Softmax,
control action, bound between -1 & 1 -> Tanh.

Using fully connected layers with the Keras Functional API

Below is an example of how to use a fully connected layer with the Keras functional API.

We are using input data shaped like an image, to show the flexibility of the fully connected layer - this requires us to use a Flatten layer later in the network:

import numpy as np
import tensorflow as tf
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense, Flatten

#  the least random of all random seeds
np.random.seed(42)
tf.random.set_seed(42)

#  dataset of 4 samples, 32x32 with 3 channels
x = np.random.rand(4, 32, 32, 3)

inp = Input(shape=x.shape[1:])
hidden = Dense(8, activation='relu')(inp)
flat = Flatten()(hidden)
out = Dense(2)(flat)
mdl = Model(inputs=inp, outputs=out)

mdl(x)
"""
<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[ 0.23494382, -0.40392348],
       [ 0.10658629, -0.31808627],
       [ 0.42371386, -0.46299127],
       [ 0.34416917, -0.11493915]], dtype=float32)>
"""

2. The 2D Convolutional Layer

If you had to pick one architecture as the most important in deep learning, it’s hard to look past convolution (see what I did there?).

The winner of the 2012 ImageNet competition, AlexNet, is seen by many as the start of modern deep learning. Alexnet was a deep convolutional neural network, trained on GPU to classify images.

Another landmark use of convolution is Le-Net-5 in 1998, a 7 layer convolutional neural network developed by Yann LeCun to classify handwritten digits.

The convolutional neural network is the original workhorse of the modern deep learning revolution - it can be used with text, audio, video and images.

Convolutional neural networks can be used to classify the contents of the image, recognize faces and create captions for images. They are also easy to parallelize on GPU - making them fast to train.

What is the intuition and inductive bias of convolutional layers?

Convolution itself is a mathematical operation, commonly used in signal processing. The 2D convolutional layer is inspired by our own visual cortex.

The history of using convolution in artificial neural networks goes back decades to the neocognitron, an architecture introduced by Kunihiko Fukushima in 1980, inspired by the work of Hubel & Wiesel.

Work by the neurophysiologists Hubel & Wiesel in the 1950’s showed that individual neurons in the visual cortexes of mammals are activated by small regions of vision.

Hubel & Wiesel

A good mental model for convolution is the process of sliding a filter over a signal, at each point checking to see how well the filter matches the signal.

This checking process is pattern recognition, and is the intuition behind convolution - looking for small, spatial patterns anywhere in a larger space. The convolution layer has inductive bias for recognizing local, spatial patterns.

How does a 2D convolution layer work?

A 2D convolutional layer is defined by the interaction between two components:

a 3D image, with shape (height, width, color channels),
a 2D filter, with shape (height, width).

The intuition of convolution is looking for patterns in a larger space.

In a 2D convolutional layer, the patterns we are looking for are filters, and the larger space is an image.

Filters

A convolutional layer is defined by it’s filters. These filters are learnt - they are equivalent to the weights of a fully connected layer.

Filters in the first layers of a convolutional neural network detect simple features such as lines or edges. Deeper in the network, filters can detect more complex features that help the network perform it’s task.

To further understand how these filters work, let’s work with a small image and two filters. The basic operation in a convolutional neural network is to use these filters to detect patterns in the image, by performing element-wise multiplication and summing the result:

Applying different filters to a small image

Reusing the same filters over the entire image allows features to be detected in any part of the image - a property known as translation invariance. This property is ideal for classification - you want to detect a cat no matter where it occurs in the image.

For larger images (which are often 32x32 or larger), this same basic operation is performed, with the filter being passed over the entire image. The output of this operation acts as feature detection, for the filters that the network has learnt, producing a 2D feature map.

A filter producing a filter map by convolving over an image

The feature maps produced by each filter are concatenated, resulting in a 3D volume (the length of the third dimension being the number of filters).

The next layer then performs convolution over this new volume, using a new set of learned filters.

The feature maps of multiple filters are concatenated to produce a volume, which is passed to the next layer.

2D convolutional neural network built using the Keras Functional API

Below is an example of how to use a 2D convolution layer with the Keras functional API:

the Flatten layer before the dense layer, to flatten our volume produced by the 2D convolutional layer,
the Dense layer size of 8 - this controls how many classes our network can predict.

import numpy as np
import tensorflow as tf
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense, Flatten, Conv2D

np.random.seed(42)
tf.random.set_seed(42)

#  dataset of 4 images, 32x32 with 3 color channels
x = np.random.rand(4, 32, 32, 3)

inp = Input(shape=x.shape[1:])
conv = Conv2D(filters=8, kernel_size=(3, 3), activation='relu')(inp)
flat = Flatten()(conv)
feature_map = Dense(8, activation='relu')(flat)
out = Dense(2, activation='softmax')(flat)
mdl = Model(inputs=inp, outputs=out)

mdl(x)
"""
<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[-0.39803684, -0.08939186],
       [-0.48165476, -0.28876644],
       [-0.32680377, -0.24380796],
       [-0.45394567, -0.28233868]], dtype=float32)>
"""

What hyperparameters are important for a convolutional layer?

The important hyperparameters in a convolutional layer are:

the number of filters,
filter size,
activation function,
strides,
padding,
dilation rate.

The number of filters determines how many patterns each layer can learn. It’s common to have the number of filters increasing with the depth of the network. Filter size is commonly set to (3, 3), with a ReLu as the activation function.

Strides can be used to skip steps in the convolution, resulting in smaller feature maps. Padding allows pixels on the edge of the image to act as if they are in the middle of an image. Dilation allow the filters to operate over a larger area of the image, while still producing feature maps of the same size.

When should I use a convolutional layer?

Convolution works when your data has a spatial structure - for example, images have spatial structure in height & width. You can also get this structure from a 1D signal using techniques such as Fourier Transforms, and then perform convolution in the frequency domain.

If you are working with images, convolution is king. While there is work applying attention based models to computer vision, because of it’s similarity with our own visual cortex, it is likely that convolution will be relevant for many years to come.

An example of using convolution occurs in DeepMind’s 2015 DQN work. The agent learns to take decisions using pixels - making convolution a strong choice:

Deep convolutional neural network used in the 2015 DeepMind DQN Atari work

So what other kinds of structure can data have, other than spatial? Many types of data have a sequential structure - motivating our next two layer architectures.

3. LSTM Layer

The third of our layers is the LSTM, or Long Short-Term Memory layer. The LSTM is recurrent and processes data as a sequence.

Recurrence allows a network to experience the temporal structure of data, such as words in a sentence, or time of day.

A normal neural network receives a single input tensor $x$ and generates a single output tensor $y$. A recurrent architecture differs from a non-recurrent neural network in two ways:

both the input $x$ & output $y$ data is processed as a sequence of timesteps,
the network has the ability to remember information and pass it to the next timestep.

The memory of a recurrent architecture is known as the hidden state $h$. What the network chooses to pass forward in the hidden state is learnt by the network.

A recurrent neural network

Entering the timestep dimension

Working with recurrent architectures requires being comfortable with the idea of a timestep dimension - knowing how to shape your data correctly is half the battle of working with recurrence.

Imagine we have input data $x$, that is a sequence of integers [0, 0] -> [2, 20] -> [4, 40]. If we were using a fully connected layer, we could present this data to the network as a flat array:

import numpy as np

x = np.zeros(10).astype(int)
x[0::2] = np.arange(0, 10, 2)
x[1::2] = np.arange(0, 100, 20)
x = x.reshape(1, -1)

print(x)
# array([[ 0,  0,  2, 20,  4, 40,  6, 60,  8, 80]])

print(x.shape)
# (1, 10)

Although the sequence is obvious to us, it’s not obvious to a fully connected layer.

All a fully connected layer would see is a list of numbers - the sequential structure would need to be learnt by the network.

We can restructure our data $x$ to explicitly model this sequential structure, by adding a timestep dimension. The values in our data do not change - only the shape changes:

import numpy as np

x = np.vstack([np.arange(0, 10, 2), np.arange(0, 100, 20)]).T
x = x.reshape(1, 5, 2)
print(x)
"""
array([[[ 0,  0],
        [ 2, 20],
        [ 4, 40],
        [ 6, 60],
        [ 8, 80]]])
"""

print(x.shape)
# (1, 5, 2)

Our data $x$ is now structured with three dimensions - (batch, timesteps, features). A recurrent neural network will process the features one timestep at a time, experiencing the sequential structure of the data.

Now that we understand how to structure data to be used with a recurrent neural network, we can take a high-level look at details of how the LSTM layer works.

How does an LSTM layer work?

The LSTM was first introduced in 1997 and has formed the backbone of modern sequence based deep learning models, excelling on challenging tasks such as machine translation. For years the state of the art in machine translation was the seq2seq model, which is powered by the LSTM.

The LSTM is a specific type a recurrent neural network. The LSTM addresses a challenge that vanilla recurrent neural networks struggled with - the ability to think long term.

In a recurrent neural network all information passed to the next time step has to fit in a single channel, the hidden state $h$.

The LSTM addresses the long term memory problem by using two hidden states, known as the hidden state $h$ and the cell state $c$. Having two channels allows the LSTM to remember on both a long and short term.

Internally the LSTM makes use of three gates to control the flow of information:

forget gate to determine what information to delete,
input gate to determine what to remember,
output gate to determine what to predict.

One important architecture that uses LSTMs is seq2seq. The source sentence is fed through an encoder LSTM to generate a fixed length context vector. A second decoder LSTM takes this contex vector and generates the target sentence.

The seq2seq model

For a deeper look at the internal of the LSTM, take a look at the excellent Understanding LSTM Networks from colah’s blog.

What is the intuition and inductive bias of an LSTM?

A good intiutive model for the LSTM layer is to think about it like a database. The output, input and delete gates allow the LSTM to work like a database - matching the GET, POST & DELETE of a REST API, or the read-update-delete operations of a CRUD application.

The forget gate acts like a DELETE, allowing the LSTM to remove information that isn’t useful. The input gate acts like a POST, where the LSTM can choose information to remember. The output gate acts like a GET, where the LSTM chooses what to send back to a user request for information.

A recurrent neural network has has an inductive bias for processing data as a sequence, and for storing a memory. The LSTM adds on top of this bias for creating one long term and one short term memory channel.

Using an LSTM layer with the Keras Functional API

Below is an example of how to use an LSTM layer with the Keras functional API:

import numpy as np
import tensorflow as tf
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense, LSTM, Flatten

np.random.seed(42)
tf.random.set_seed(42)

#  dataset of 4 samples, 3 timesteps, 32 features
x = np.random.rand(4, 3, 32)

inp = Input(shape=x.shape[1:])
lstm = LSTM(8)(inp)
out = Dense(2)(lstm)
mdl = Model(inputs=inp, outputs=out)
mdl(x)

"""
<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[-0.06428523,  0.3131591 ],
       [-0.04120642,  0.3528567 ],
       [-0.04273851,  0.37192333],
       [ 0.03797218,  0.33612275]], dtype=float32)>
"""

You’ll notice we only get one output for each of our four samples - where are the other two timesteps? To get these, we need to use return_sequences=True:

tf.random.set_seed(42)
inp = Input(shape=x.shape[1:])
lstm = LSTM(8, return_sequences=True)(inp)
out = Dense(2)(lstm)
mdl = Model(inputs=inp, outputs=out)
mdl(x)

"""
<tf.Tensor: shape=(4, 3, 2), dtype=float32, numpy=
array([[[-0.08234972,  0.12292314],
        [-0.05217044,  0.19100665],
        [-0.06428523,  0.3131591 ]],

       [[ 0.0381453 ,  0.26402596],
        [ 0.04725918,  0.34620702],
        [-0.04120642,  0.3528567 ]],

       [[-0.21114576,  0.08922277],
        [-0.02972354,  0.24037611],
        [-0.04273851,  0.37192333]],

       [[-0.06888272, -0.01702049],
        [ 0.0117887 ,  0.10608622],
        [ 0.03797218,  0.33612275]]], dtype=float32)>
"""

It’s also common to want to access the hidden states of the LSTM - this can be done using the argument return_state=True.

We now get back three tensors - the output of the network, the LSTM hidden state and the LSTM cell state. The shape of the hidden states is equal to the number of units in the LSTM:

tf.random.set_seed(42)
inp = Input(shape=x.shape[1:])
lstm, hstate, cstate = LSTM(8, return_sequences=False, return_state=True)(inp)
out = Dense(2)(lstm)
mdl = Model(inputs=inp, outputs=[out, hstate, cstate])
out, hstate, cstate = mdl(x)

print(hstate.shape)
# (4, 8)

print(cstate.shape)
# (4, 8)

If you wanted to access the hidden states at each timestep, then you can combine these two and use both return_sequences=True and return_state=True.

What hyperparameters are important for an LSTM layer?

For an LSTM layer, the main hyperparameter is the number of units. The number of units will determine the capacity of the layer and size of the hidden state.

While not a hyperparameter, it can be useful to include gradient clipping when working with LSTMs, to deal with exploding gradients that can occur from the backpropagation through time. It is also common to use lower learning rates to help manage gradients.

When should I use an LSTM layer?

In 2023, the answer to this is never. If you have the type of data (sequential) that is suitable for an LSTM, you should look at using attention.

When working with sequence data, an LSTM (or it’s close cousin the GRU) used to be the best choice. One major downside of the LSTM is that they are slow to train as the error signal must be backpropagated through time. Backpropagating through an LSTM cannot be parallelized.

One useful feature of the LSTM is the learnt hidden state. This can be used by other models as a compressed representation of the future - such as in the 2017 World Models paper.

4. Attention Layer

Attention is the youngest of our four layers.

Since it’s introduction in 2015, attention has revolutionized natural language processing. Attention powers some of the most breathtaking achievements in deep learning, such as the GPT-X series of language models.

First used in combination with the LSTM based seq2seq model, attention powers the Transformer - a neural network architecture that forms the backbone of modern language models.

Attention is as a sequence model without recurrence - by avoiding the need to do backpropagation through time, attention can be parallelized on GPU, which means it’s fast to train.

What is the intuition and inductive bias of attention layers?

Attention is a simple and powerful idea - when processing a sequence, we should choose what part of sequence to take information from. The intuition is simple - some parts of a sequence are more important that others.

Take the example of machine translation, to translate the German sentence Ich bin eine Maschine into the English I am a machine.

When predicting the last word in the translation machine, all of our attention should be placed on the last word of the source sentence Maschine. There is no point looking at earlier words in the source sequence when translating this token.

If we take a more complex example of translating the German Ich habe ein bisschen Deutsch gelernt into the English I have learnt a little German. When predicting the third token of our English sentence (learnt), attention should be placed on the last token of the German sentence (gelernt).

So what inductive bias does our attention layer give us? One inductive bias of attention is alignment based on similarity - the attention layer chooses where to look based on how similar things are.

Another inductive bias of attention is to limit & prioritize information flow. As we will see below, the use of a softmax forces an attention layer to make tradeoffs about information flow - more weight in one place means less in another.

There is no such restriction in a fully connected layer, where increasing one weight does not affect another. A fully connected layer can allow information to flow between all nodes in subsequent layers, and could in theory learn a similar pattern that an attention layer does. We know by now however that in theory does note mean it will occur in practice.

How does an attention layer work?

The attention layer receives three inputs:

query = what we are looking for,
key = what we compare the query with,
value = what we place attention over.

The attention layer can be thought of as three mechanisms in sequence:

alignment (or similarity) of a query and keys
softmax to convert the alignment into a probability distribution
selecting keys based on the alignment

The three steps in an attention layer - alignment, softmax & key selection

Different attention layers (such as Additive Attention or Dot-Product Attention) use different mechanisms in the alignment step. The softmax & key selection steps are common to all attention layers.

Query, key and value

In the same way that understanding the time-step dimension is a key step in understanding recurrent neural networks, understanding what the query, key & value mean is foundational in attention.

A good analogy is with the Python dictionary. Let’s start with a simple example, where we:

look up a query of dog
to match with keys of dog or cat with values of 1 or 2 respectively
and select the value of 2 based on this lookup of dog

query = 'dog'
#  keys = 'cat', 'dog', values = 1, 2
database = {'cat': 1, 'dog': 2}
database[query]
#  2

In the above example, we find an exact match for our query 'dog'. However, in a neural network, we are not working with strings - we are working with tensors. Our query, keys and values are all tensors:

query = [0, 0.9]
#  keys = [0, 0], [0, 1] values = [0], [1]
database = {[0, 0]: [0], [0, 1]: [1]}

Now we don’t have an exact match for our query - instead of using an exact match, we instead can calculate a similarity (i.e. an alignment) between our query and keys, and return the closest value:

database.similarity(query)
#  [1]

Small technicality - often the keys are set equal to the values. This simply means that the quantity we are doing the similarity comparison with is also the quantity we will place attention over.

Attention mechanisms

By now we know that an attention layer involves three steps:

alignment based on similarity,
softmax to create attention weights,
choosing values based on attention.

The second & third steps are common to all attention layers - the differences all occur in the first step - how the alignment on similarity is done.

We will briefly look at two popular mechanisms - Additive Attention and Dot-Product Attention. For a more detailed look at these mechanisms, have a look at the excellent Attention? Attention! by Lilian Wang.

Additive Attention

This first use of attention (known as Bahdanau or Additive Attention) addressed one of the limitations of the seq2seq model - namely the use of a fixed length context vector.

As explained in the LSTM section, the basic process in a seq2seq model is to encode the source sentence into a fixed length context vector. The issue is with all of the information from the encoder must pass through the fixed length context vector. Infomation from the entire source sequence is squeezed through this context vector inbetween the encoder & decoder.

In Bahdanau et. al 2015, Additive Attention is used to learn an alignment between all the encoder hidden states and the decoder hidden states. As the sequence is processed, the output of this alignment is used in the decoder to predict the next token.

Dot-Product Attention

A second type of attention is Dot-Product Attention - the alignment mechanism used in the Transformer. Instead of using addition, the Dot-Product Attention layer uses matrix multiplication to measure similarity between the query and the keys.

The dot-product acts like a similarity between the keys & values - below is a small program that plots both the dot-product and the cosine similarity for random data:

from collections import defaultdict

import matplotlib.pyplot as plt
import numpy as np
from scipy.spatial.distance import cosine

data = defaultdict(list)
for _ in range(100):
    a = np.random.normal(size=128)
    b = np.random.normal(size=128)
    data['cosine'].append(cosine(a, b))
    data['dot'].append(np.dot(a, b))

plt.figure(figsize=(10, 10))
_ = plt.scatter(data['cosine'], data['dot'])
plt.xlabel('cosine')
plt.ylabel('dot-product')

The relationship between the cosine similarity and the dot-product of random vectors

Implementing a Single Attention Head with the Keras Functional API

Dot-Product Attention is important as it forms part of the Transformer. As you can see in the figure below, the Transformer uses multiple heads of Scaled Dot-Product Attention.

The multi-head attention layer used in the Transformer

The code below demonstrates the mechanics for a single head without scaling - see Transformer Model for Language Understanding for a full implementation of a multi-head attention layer & Transformer in Tensorflow 2.

import numpy as np
import tensorflow as tf
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense

qry = np.random.rand(4, 16, 32).reshape(4, -1, 32).astype('float32')
key = np.random.rand(4, 32).reshape(4, 1, 32).astype('float32')
values = np.random.rand(4, 32).reshape(4, 1, 32).astype('float32')

q_in = Input(shape=(None, 32))
k_in = Input(shape=(1, 32))
v_in = Input(shape=(1, 32))

capacity = 4
q = Dense(4, activation='linear')(q_in)
k = Dense(4, activation='linear')(k_in)
v = Dense(4, activation='linear')(v_in)

score = tf.matmul(q, k, transpose_b=True)
attention = tf.nn.softmax(score, axis=-1)
output = tf.matmul(attention, v)

mdl = Model(inputs=[q_in, k_in, v_in], outputs=[score, attention, output])
sc, attn, out = mdl([qry, key, values])
print(f'query shape {qry.shape}')
print(f'score shape {sc.shape}')
print(f'attention shape {attn.shape}')
print(f'output shape {out.shape}')
"""
query shape (4, 16, 32)
score shape (4, 16, 1)
attention shape (4, 16, 1)
output shape (4, 16, 4)
"""

This architecture also works with a different length query (now length 8 rather than 16):

qry = np.random.rand(4, 8, 32).reshape(4, -1, 32).astype('float32')
sc, attn, out = mdl([qry, key, values])
print(f'query shape {qry.shape}')
print(f'score shape {sc.shape}')
print(f'attention shape {attn.shape}')
print(f'output shape {out.shape}')
"""
query shape (4, 8, 32)
score shape (4, 8, 1)
attention shape (4, 8, 1)
output shape (4, 8, 4)
"""

What hyperparameters are important in an attention layer?

When using attention heads as shown above, hyperparameters to consider are:

size of the linear layers used to transform the query, values & keys
the type of attention mechanism (such as additive or dot-product)
how to scale the alignment before the softmax (often done using the square-root of the length of the layer)

When should I use an attention layer?

Attention layers should be considered for any sequence problem. Unlike recurrent neural networks, they can be easily parallelized, making training fast. Fast training means either cheaper training, or more training for the same amount of compute.

The Transformer is a sequence model without recurrence (it doesn’t use an LSTM), allowing it to be trained without backpropagation through time.

One additional benefit of an attention layer is being able to use the alignment scores for interpretability - similar to how we can use the hidden state in an LSTM as a representation of the sequence.

Summary

I hope you enjoyed this post and found it useful! Below is a short table summarizing the article:

Layer	Intuition	Inductive Bias	When To Use
Fully connected	Allow all possible connections	None	Data without structure (tabular data)
2D convolution	Recognizing spatial patterns	Local, spatial patterns	Data with spatial structure (images)
LSTM	Database	Sequences & memory	Never - use Attention
Attention	Focus on similarities	Similarity & limit information flow	Data with sequential structure

Thanks for reading!

If you enjoyed this post, check out Artificial Intelligence, Machine Learning and Deep Learning.

A Hackers Guide to AEMO & NEM Data

2022-12-10T00:00:00+00:00

This is a short guide to the electricity grid & market data supplied by the Australian Energy Market Operator (AEMO) for the Australian National Electricity Market (NEM).

The NEM is Australia’s electricity grid in Queensland, New South Wales, Victoria, South Australia, and Tasmania.

Participant Infomation & Carbon Intensities

Market participant information in the NEM is given in the NEM Registration and Exemption List:

The carbon intensities for generators are given in the Available Generators CDEII file:

Both of these files are linked by a Dispatchable Unit Identifier (DUID), which identifies generating unit.

Interval Data

Interval data for the NEM is provided in two sources the NEM Dispatch Engine (NEMDE) and the Market Management System Data Model (MMSDM).

NEMDE

The NEMDE dataset provides infomation about how the grid is dispatched and price are set (including infomation about the marginal generator) in the NemPriceSetter XML files.

Data for each day is provided in a single ZIP file (NemPriceSetter_20220101_xml.zip), which contains many XML files:

# NemPriceSetter_20220101_xml/NEMPriceSetter_2022010100100.xml

<PriceSetting PeriodID="2022-01-01T04:05:00+10:00" RegionID="NSW1" Market="Energy" Price="87.69011" Unit="LBBG1" DispatchedMarket="R5RE" BandNo="6" Increase="1" RRNBandPrice="23.7" BandCost="23.7" />
<PriceSetting PeriodID="2022-01-01T04:05:00+10:00" RegionID="NSW1" Market="Energy" Price="87.69011" Unit="BW04" DispatchedMarket="R5RE" BandNo="1" Increase="-0.47368" RRNBandPrice="1" BandCost="-0.473684" />
<PriceSetting PeriodID="2022-01-01T04:05:00+10:00" RegionID="NSW1" Market="Energy" Price="87.69011" Unit="BW03" DispatchedMarket="R5RE" BandNo="1" Increase="-0.52632" RRNBandPrice="1" BandCost="-0.526316" />

MMSDM

The MMSDM provides both actual data and forecasts for a range of variables - including prices, demand and electricity flows.

Data in the MMSDM is supplied from three different, overlapping sources:

CURRENT - last 24 hours,
ARCHIVE - last 13 months,
MMSDM - from 2009 until the end of last month.

Some report names can be different across sources - for example DISPATCH_SCADA versus UNIT_SCADA.

Price Structure

The settlement price in the NEM is known as the trading price - it is the price that matters for what generators get paid and what customers pay.

Historically (before October 2021) it was settled on a 30 minute basis, as the average of the six 5 minute dispatch prices for the same interval.

AEMO Timestamping

AEMO timestamp with the time at the end of the interval. This means that 01/01/2018 14:00 refers to the time period 01/01/2018 13:30 - 01/01/2018 14:00. This will be true for columns like SETTLEMENTDATE, which refer to an interval. Columns like LASTCHANGED, which refer to a single instant in time are not affected by this.

I prefer shifting the AEMO time stamp backwards by one step of the index frequency (i.e. 5 minutes). This allows the following to be true:

dispatch_prices.loc['01/01/2018 13:30': '01/01/2018 14:00'].mean() == trading_price.loc['01/01/2018 13:30']

The shifting also allows easier alignment with external data sources such as weather, which is usually stamped with the timestamp at the beginning of the interval.

If the AEMO timestamp is not shifted, then the following is true:

dispatch_prices.loc['01/01/2018 13:35': '01/01/2018 14:05'].mean() == trading_price.loc['01/01/2018 14:00']

Useful MMSDM Reports

All examples below are for MMSDM May 2018:

Actual Data

trading price (30 & 5 min electricity price) - TRADINGPRICE - PUBLIC_DVD_TRADINGPRICE_201805010000.zip,
dispatch price (5 min electricity price) - DISPATCHPRICE - PUBLIC_DVD_DISPATCHPRICE_201805010000.zip,
generation of market participants - UNIT_SCADA - PUBLIC_DVD_DISPATCH_UNIT_SCADA_201805010000.zip,
market participant bid volumes - BIDPEROFFER - PUBLIC_DVD_BIDPEROFFER_201805010000.zip,
market participant bid prices - BIDAYOFFER - PUBLIC_DVD_BIDDAYOFFER_201805010000.zip,
demand - DISPATCHREGIONSUM - PUBLIC_DVD_DISPATCHREGIONSUM_201805010000.zip,
interconnectors - INTERCONNECTORRES - PUBLIC_DVD_DISPATCHINTERCONNECTORRES_201805010000.zip.

Forecasts

trading price forecast - PUBLIC_DVD_PREDISPATCHPRICE_201805010000.zip,
dispatch price forecast - PUBLIC_DVD_P5MIN_REGIONSOLUTION_201805010000.zip.

Ecosystem

A major benefit of the large & open dataset shared by AEMO is the ecosystem tools built on top of it.

nem-data

A simple CLI for downloading NEMDE & MMSDM data - created & maintained by yours-truly:

$ pip install nem-data
$ nemdata --table trading-price --start 2020-01 --end 2020-12

NEMOSIS

A Python package for downloading historical data published by the Australian Energy Market Operator (AEMO):

$ pip install nemosis

Use in Python:

from nemosis import dynamic_data_compiler

start_time = '2017/01/01 00:00:00'
end_time = '2017/01/01 00:05:00'
table = 'DISPATCHPRICE'
raw_data_cache = 'C:/Users/your_data_storage'

price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache)

AEMO Dashboard - interactive map

Electricity Map

AREMI

NEM Log

Open NEM

NEM Sight

Gas & Coal Watch

Typical Year Forecasting of Electricity Prices

2022-12-04T00:00:00+00:00

Energy prices are volatile - the price of gas, oil and electricity can all change significantly year on year. Yet the energy industry ignores this year on year volatility when modelling investment decisions in energy projects.

This exposes projects to a significant source of hidden error in the form of variance in financial model results, leading to the wrong projects being built.

This post introduces a simple solution to this problem in the form of a typical year forecast.

You can find supporting materials for this work at adgefficiency/typical-year-forecasting-electricity-prices.

What is a Typical Year Forecast?

A typical year forecast uses historical data to create a single, synthetic year of data.

This single year forecast is suitable for use in business case modelling of energy projects - it’s not suitable for short term dispatch of energy assets.

A typical year forecast has the following advantages:

simple to create - no machine learning, gradients or iterative calculations,
interpretable - easy to understand why one sample is selected over others,
realistic - the forecast is made from real historical data,
domain flexible - can be used with any time series,
statistically flexible - can use a range of statistics to define what typical means.

A typical year forecast has the following disadvantages:

data quantity - requires at least 2 years of historical data,
domain knowledge - requires selection & weighting of statistics.

An example of a typical year forecast is a typical metrological year (TMY) forecast, used to create a dataset of typical year of weather. TMY forecasts are commonly used in modelling solar generation or building energy use.

The idea & inspiration for this post came from using the TMY forecast produced by Solcast - thanks Solcast for the inspiration!

The Problem with the Standard Industry Approach

Estimating the economic performance (simple payback, IRR, NPV or rate of return on capital) of an investment in an energy project requires combining two models - a technical model and a financial model.

Commonly the technical model will model a single year in isolation, and is used as an input to the financial model.

The financial model will model multiple years over time (to model economic return over time), using the technical results as the basis for the first year with the financial inputs (such as prices) forecasted forward based on the single year technical results.

In the absence of forecasted energy prices across the future project lifetime, energy prices are often modelled in a similar way to the technical model - taking a single reference year of prices and forecasting them forward with assumptions of inflation.

A simple example of how a technical & financial model combine is given below:

a technical model outputs annual savings of 150 MWh of electricity,
we assume electricity prices at 100 $/MWh
capital investment is estimated at $ 25,000.

The technical inputs & price assumptions are then forecast forward (here without inflation) to calculate cumulative savings:

year	capex	savings_mwh	price	savings_$	cumulative_savings_$
0	25000	150	100	15000	-10000
1	0	150	100	15000	5000
2	0	150	100	15000	20000
3	0	150	100	15000	35000

It’s not common to see both the project capex and savings in the same year (usually you need to build something before it gives a saving) - for this simple example please forgive this!

Why Using The Most Recent Prices is Wrong

Choosing the reference year for prices is commonly done by:

taking the most recent prices,
taking the most recent full calendar year of prices,
taking the prices that align with the technical model.

If we were setting up our model in November 2022 with a technical model based on 2019 data, the standard industry approach would likely be one of the following:

the most recent prices - October 2021 to September 2022,
the most recent calendar year - January 2021 to December 2021,
prices that align with the technical data - January 2019 to December 2019.

Below we will demonstrate why all of these commonly used methodologies introduce a large source of error.

Error of Using Recent Prices

In our example above, we assumed prices at 100 $/MWh. The figure below uses the same financial model with the actual annual average electricity prices for South Australia:

Look at the variance of these results! Around half of our projects lose money, with the other half being profitable.

This variance error that the standard industry approaches are hiding - normally we only get a single estimate, without seeing the spread across different years of price data.

This variance in project performance is only occurring based on when we do our modelling - not based on the fundamental, underlying economics of the project.

We can do better!

Creating a Typical Year Forecast

Creating a typical year forecast requires defining what typical means.

For these forecasts we will define typical as similarity - our typical year forecast will be made of samples of data that are most similar to all the other data.

We can quantify similarity by defining an error metric - the error between statistics measured across all our data and statistics measured across a candidate sample. The samples that minimize this error will be selected and used in our forecast.

For our first typical year forecast, we will create a forecast based on a single statistic - the average price within a month.

The basic idea is as follows:

#  Creating a Typical Year Forecast based on the Mean with 5 Years of Historical Data.

#  Iterate across each month in a year (12 months in total).
for each month in a year (Jan, Feb ... Nov, Dec)
 
  #  Calculate one long term statistic across all 5 years for this one month.
  long_term_mean = historical_data[month].mean()

  #  Iterate across our historical data, selecting this one month,
  #  5 months across 5 years, all the same month.
  for year in historical_data
    sample_mean = year[month].mean()

    #  Calculate the error of this month versus the long term statistic.
    sample_error = absolute(sample_mean - long_term_mean)

  #  Select the sample with the lowest sample error,
  #  this is the historical month we will use in our typical year forecast.
  selected_sample = argmin(sample_errors)

After following this procedure, we will select 12 monthly samples - one for each month in a year, creating our typical year forecast.

Typical Year Forecast for South Australian Electricity Prices

To further demonstrate the idea, we will first limit ourselves to forecasting a single month - January, for electricity prices in South Australia, using 10 years of historical data.

Let’s first start by calculating our long term statistic - the average price in January across the entire dataset, which is 85.449 $/MWh.

We can then look at what the average price was in each January and calculate the error versus the long term statistic.

This leads us to selecting January 2017 as our typical month of electricity prices:

year	month	price-mean	long-term-mean	error-mean
2012	January	25.6153	85.449	59.8337
2013	January	59.1246	85.449	26.3244
2014	January	88.8675	85.449	3.41845
2015	January	34.68	85.449	50.769
2016	January	50.2573	85.449	35.1917
2017	January	84.2589	85.449	1.19009
2018	January	158.757	85.449	73.3081
2019	January	241.025	85.449	155.576
2020	January	83.2037	85.449	2.24526
2021	January	28.7008	85.449	56.7482

We can then repeat the procedure above to forecast the remaining 11 months of the year, ending up with 12 months that make up our typical year forecast:

year	month	price-mean	long-term-mean	error-mean
2017	January	84.2589	85.449	1.19009
2020	February	64.1771	71.2239	7.04685
2021	March	68.7727	66.6858	2.08692
2021	April	52.1361	64.1214	11.9854
2016	May	70.6976	70.1316	0.565976
2021	June	84.3886	81.6753	2.71335
2021	July	91.1873	94.7737	3.58638
2016	August	66.2397	64.8625	1.37717
2012	September	53.7977	54.7594	0.961707
2012	October	50.9616	52.3186	1.35705
2016	November	61.8883	57.3279	4.56045
2015	December	66.8321	67.2765	0.444369

Our typical year forecast, in all it’s light blue glory:

We can compare this typical year forecast to actual historical prices - for the years where we have sampled our typical month from, our forecast directly overlaps the historical data:

Extending the Forecast With More Statistics

Above we only considered the mean when selecting a month. The mean is a measurement of the central tendency of a distribution - using the mean to select a month will mean our forecast has a similar central point to the long term average.

For some energy models, the variance is more important than the average.

The variance is how spread out prices are - it’s important for batteries operating in wholesale arbitrage, as this spread puts an upper limit on the profitability of shifting of electricity between intervals can be.

Our procedure for creating a typical year forecast based on both the mean and the variance is similar to only considering the mean.

We instead calculate two additional statistics (the long term standard deviation and the sample standard deviation), and include them in our sample error:

#  Creating a typical year forecast based on the mean & standard deviation

#  Iterate across each month in a year.
for month in (Jan, Feb ... Nov, Dec):

  #  Calculate two statistics - long term mean & standard deviation.
  long_term_mean = data.mean()
  long_term_std = data.std()

  #  Iterate across historical data & calculate sample errors,
  #  using both long term statistics
  for year in (historical data):
    sample_mean = month.year.mean()
    sample_std = month.year.std()
    sample_error = absolute(long_term_mean - sample_mean) + absolute(long_term_std - sample_std)

  #  Select sample that minimizes error.
  selected_sample = argmin(sample_errors)

Taking this approach again, we end up with our typical year forecast - different from our previous forecast where we only used the mean:

month	year	price-mean	long-term-mean	price-std	long-term-std	error
January	2020	83.2037	85.449	519.785	504.705	17.3251
February	2018	109.17	71.2239	290.873	300.955	48.0282
March	2020	46.9517	66.6858	225.829	271.301	65.2057
April	2015	39.9493	64.1214	100.387	99.2508	25.3085
May	2016	70.6976	70.1316	132.686	133.63	1.5091
June	2021	84.3886	81.6753	96.1186	130.305	36.8999
July	2015	73.5053	94.7737	226.191	236.491	31.5684
August	2013	71.2364	64.8625	88.1036	103.648	21.9185
September	2012	53.7977	54.7594	62.1015	75.617	14.4772
October	2019	67.3398	52.3186	92.2279	108.001	30.7947
November	2019	50.8623	57.3279	88.3317	109.014	27.1474
December	2013	79.5734	67.2765	372.848	318.756	66.3892

We can compare our two typical year forecasts directly:

Typical year forecasting based on both the mean and the variance is selecting months with higher prices - including more of the tasty price spikes that makes Australia’s National Electricity Market (NEM) so interesting for battery storage.

Evaluating the Typical Year Forecast

Let’s return to our original motivating example, with an additional estimate of our project cumulative savings using our typical year forecast based on using the mean (show as 2052 in green):

How great is that!

Our typical year forecast does a fantastic job of cutting through the variance - modelling our project right in the middle of the high variance estimates we get when taking the traditional, industry standard approaches of using historical price data.

No longer are we slaves to the cruel master of time (well, perhaps we still are) - as the years go by, our estimation of project economics will stay stable and consistent, rather than varying wildly based on when we are doing our modelling.

As new price data becomes available, our typical year forecast will change (due to both the long term statistics changing, or recent data being more typical), but the variance from these changes will be minor compared to the massive year on year swings we get with the standard industry approaches.

Discussion

Above we have seen how great our typical year forecast is at reducing the variance of our estimates of project performance - let’s now discuss some challenges and potential extensions to this simple typical year forecasting method.

Challenges

Data Quantity

This methodology requires multiple years of data - if we only have access to a single year, this method is not appropriate.

Alignment

One problem that arises when concatenating interval data from different time periods together is alignment at the intersection - the sample below from the typical year forecast produced above shows the issue - our forecast jumps from Tuesday in January 2017 to Friday 2020:

forecast	original-timestamps	price	day-of-week-forecast	day-of-week-original
2052-01-31 23:50:00	2017-01-31 23:50:00	39.52	2	1
2052-01-31 23:55:00	2017-01-31 23:55:00	39.52	2	1
2052-02-01 00:00:00	2020-02-01 00:00:00	299.2	3	5
2052-02-01 00:05:00	2020-02-01 00:05:00	299.2	3	5

This misalignment will cause issues with the incorrect number of weekdays or weekends in a year - important as energy demand and price has strong weekly seasonality.

This alignment problem also occurs when you don’t use a typical year forecast - for example if you use price data from 2022 with technical data from 2010.

Domain Expertise

Domain expertise is required to setup a typical year forecast - primarily in defining the appropriate statistics.

Using multiple statistics can also require weighting - for example if the standard deviation is orders of magnitude higher than the mean, we may want to weight the mean higher.

Extensions & Improvements

Higher Frequency Sampling

In the examples above we have selected samples on a monthly basis - it is possible to instead select samples on a different frequency, such as week of the year (52 weeks) or day of the year (365 days).

More Statistics

One advantage of this methodology are flexibility of statistics we choose - unlike a loss function for a neural network, they do not need to be differentiable.

For example, we could use statistics like:

mean, median, mode,
number of time periods above a threshold price,
number of negative prices.

This is an exciting feature of typical year forecasting - the flexibility and simplicity of using any statistic that aligns with what your technical and financial models need to align with your business goals.

Summary

In this post we introduced typical year forecasting - a flexible, powerful forecasting method suitable for use in energy project business case modelling.

Typical year forecast address a hidden flaw in the price assumptions commonly used in industry - the large errors introduced by using recent price data.

A typical year forecast addresses these issues by selecting historical price data that is most similar to all the historical data.

Typical year forecasts have the following advantages:

simple to create - no machine learning, gradients or iterative calculations,
interpretable - easy to understand why one sample is selected over others,
realistic - the forecast is made from actual historical data,
domain flexible - can be used with any time series (not just electricity prices),
statistically flexible - can use a range of statistics to define what typical means.

A typical year forecast has the following disadvantages:

data quantity - requires at least 2 years of historical data,
domain knowledge - requires selecting & weighting of statistics based on problem understanding.

Further extensions on the methods shown above include:

higher frequency sampling on a weekly or daily basis,
using a variety of statistics to define similarity, such as the number of price spikes or the number of negative prices.

Thanks for reading!

If you enjoyed this post, make sure to check out Measuring Forecast Quality using Linear Programming.

You can find the materials to reproduce this analysis at adgefficiency/typical-year-electricity-price-forecasting.

Jevon’s Paradox

2022-09-04T00:00:00+00:00

created: 2017-10-30, updated: 2022-09-04

It’s intuitive that improving energy efficiency will reduce energy use. Unfortunately, it’s not that simple.

The Coal Question

In the 1865 book The Coal Question W. Stanley Jevons points out that efficiency improvements in the production of iron occurred at the same time as increases in the total amount of coal used to produce iron.

The improved efficiency of coal production did not reduce coal consumption - instead coal consumption increased. LED lighting is a modern example of Jevons Paradox, with high efficiency LED lights covering the planet.

This is Jevons Paradox - that improving efficiency of resource production leads to increases in resource consumption. This is an inconvenient truth for energy efficiency.

Thinking In Second & Third Order Effects

It’s not that efficiency doesn’t work - improving efficiency means there will be less primary energy per unit of utility.

It’s what happens after that is the problem - the efficiency gains are cancelled out by second and third order effects. Let’s look at some of the effects of improving the efficiency of gas-fired heating:

first order effect - less gas is required to supply the same amount of heat. This effect is positive - we don’t burn as much gas to provide the same amount of energy.
second order effect - we now get more heat for the same amount of money. We spend the same amount, we get more heat - but no carbon saving. We can afford to heat bigger homes for the same amount of gas.
third order effect - increased efficiency leads to less money paid by consumers) for gas - meaning this money can be spent elsewhere. What does the economy do with this saved money?

If the efficiency saving is spent on taking a long haul holiday, we could actually see an increase in global carbon emissions. We improve the efficiency of supplying heat but overall as a civilization we burn more carbon. Alternatively if the saving is spent on building cleaner energy generation then even increases in utility could lead to a carbon saving.

It’s very difficult to understand what effect Jevons Paradox has across different consumers, economies and technologies. Measuring the first order effects of energy efficiency projects is notoriously difficult - let alone any second or third order effects.

Is Energy Efficiency Still Worthwhile?

Energy efficiency drives economic progress - this makes it worth doing. Yet for those concerned with decarbonization, energy efficiency may not be as effective as expected.

Jevons Paradox does not only apply - negative second or third order effects of energy efficiency can be smaller or larger than the efficiency saving. There is also huge value from increasing adoption of advanced technology, such as the additional light from we get from LED lights.

In some cases however, focusing on making sure energy comes from clean primary sources is a safer bet than trying to use dirty energy more efficiently - you may just end up using more dirty energy as a result.

ADG Efficiency

From Vim & QWERTY to Neovim & DVORAK

The Journey

Vim

Do I Recommend Vim?

Neovim

Do I Recommend Neovim?

DVORAK

Do I Recommend DVORAK?

Split Keyboard

Do I Recommend a Split Keyboard?

Thoughts on DVORAK and Vim

Why DVORAK?

Losing hjkl

Combinations That Work Great

Challenges

Summary

Measuring Forecast Accuracy with Linear Programming

Predictive Accuracy vs. Business Value

Data

Method

Discussion

Extend to Different Domains

Negative Value

Full Example

Summary

Mistakes Data Scientists Make

Introduction

Plot the Target

Regression

Classification

Dimensionality

The Value of Low Dimensional Data

The Challenges of High Dimensional Data

Applying the Curse of Dimensionality

Too Many Hyperparameters

Too Many Features

Too Many Metrics

Too Many Models

Learning Rate

Where Error Comes From

Bias & Variance

Adding Model Capacity

Reducing Model Capacity

Adding Data

Width & Depth of Neural Nets

PEP 8

Drop the Target

Scale the Target or Features

Work with a Sample

Creating a Subset of the Data

Controlling the Debugging

Don’t Write over Raw Data

Use $HOME

Daniel C. Dennett’s Four Competences

What is Competence?

Evolutionary Learning

Comparing Competence

The Four Competences

1. Darwinian Competence

2. Skinnerian Competence

3. Popperian Competence

4. Gregorian Competence

Comparing the Four Competences

Summary

Space Between Money and the Planet

Motivation

The importance of battery storage

Arbitrage of money and carbon

Tradeoff between profit maximization and emissions minimization

The ‘just make money’ fallacy

Methods

Experiment design

Re-run the experiment

Signals and worlds

Data

Dependencies

Battery model

Results

Optimize for price or carbon

What can `energypylinear` do?