Deep Learning with Integer Activations

Simon Ramstedt, 2018-02-10

A couple of weeks ago I got to present Temporally Efficient Deep Learning with Spikes by O’Connor et al, 2018 in a reading group at the lab. I liked the modular way in which it presents its method. It has little boxes like this

that describe the stateful modules that make up the algorithms. Here, I want to look in detail at the mathematical assumptions that have to be made for the method to be valid. While most (but not all) of the math here can also be found scattered throughout the paper, I am trying to present it in a more linear, proof-like manner.

Why spiking neural networks?

Spiking neural networks are interesting in two ways. 1) The brain uses spikes and we want to understand how it works. 2) Spikes are binary and therefore are cheaper to communicate and store and also have the potential to reduce the costly weight multiplication in neural networks to a cheap sum of integer weights.

In the brain, each neuron has on average 7000 connections to other neurons. Therefore it makes sense to trade computation and bandwidth in the connections for computation on the neuron level, i.e. spend computation on encoding the activations.

To recap, a typical, non-spiking neural activation is computed as

x' = h(z) \quad \text{with} \quad z = w \ x = \color{grey}{\left(∑_i w_{ij} \ x_i \right)_j}

where $x$ are the activations of the previous layer (or the network inputs) and $h$ is a non-linear function, e.g. $h(z) = \max(z, 0)$ .

Usually floating point numbers are used to represent weights and activations. Floating point numbers are divided into an exponent and a mantissa. To multiply them we have to integer-add the exponent and integer-multiply the mantissa. This is implemented in hardware but it still requires a lot of chip space and energy. However, if $x$ were binary (i.e. a spike), we can compute $z$ with a sparse sum. We could even use integers to represent $w$ and then the multiplication would be as simple and computationally cheap as it gets.

In their paper O’Connor et al, 2018 introduce an encoding scheme that uses “integer spikes” in the forward pass, backward pass and for the weight updates without loss in accuracy compared to non-spiking networks.

One thing about the paper I found somewhat misleading is that the authors call their method spiking, when they actually use “integer spikes”. Using integers instead of binary values to communicate activations is still much cheaper than floats but it requires integer multiplications to compute the inner product with the weights. So a more appropriate name would have been “Temporally Efficient Deep Learning with Integer Activations”. Nevertheless the paper is very insightful and it would probably be possible to tweak the method in certain ways to allow it to work with only binary spikes.

Below we see the dataflow from one neuron to another neuron in a standard neural network. In the next section we will focus on the axon part, i.e., communicating the activations and trying to find a bandwidth saving encoding.

\cdots \underbrace{ \overset{\!\!\! z} {\!\!\!⟶} \ h \ }_\text{neuron a} \ \boxed{\underbrace{\overset {x} ⟶}_\text{axon}} \ \underbrace{w}_\text{synapses} \ \underbrace{\overset{\!\!\! z} {\!\!\!⟶} h \ }_\text{neuron b} \overset {x} ⟶ \cdots

Predictive Coding $\ \ x → \text{enc} → a → \text{dec} → \hat x$

In predictive coding the sender and receiver share a model for the temporal evolution of the signal between them. Instead of communicating the original signal, only the model error is communicated and therefore only the model error is affected by channel noise which results in a higher signal-to-noise ratio.

Predictive coding is usually not used for neuron-to-neuron communication because the channel is not noisy (we usually use float32 to communicate the activations). Since we want to save bandwidth however, we will have to quantize the signal and therefore introduce quantization noise (see next section).

The neuron-to-neuron communication in a standard neural network without predictive coding can be framed as predictive coding with the model $x_t = 0 \color{grey}{ + a_t}$ with the error $a_t = x_t$ that has to be communicated. Another very simple model would be to assume the signal stays constant, i.e. $x_t = x_{t-1} \color{grey}{ + a_t}$ , then we would only transmit activation changes.

O’Connor et al. use a similar decaying model:

x_t = \frac {k_d } {k_p + k_d} x_{t-1} \color{grey}{ + \frac 1{k_p + k_d} a_t}

Note that the error is $a_t$ scaled by the factor $\frac 1{k_p + k_d}$ . We can rewrite the model equation as an encoder-decoder pair:

\begin{aligned} \text{enc:} & \quad a_t = k_p x_t + k_d (x_t - x_{t-1}) \\ \\ \text{dec:} & \quad \hat x_t = x_t = \frac {a_t + k_d x_{t-1}} {k_p + k_d}\end{aligned}

We also can unroll the $x_{t-1}$ in this expression (useful for later):

\begin{aligned} x_t &= \frac {a_t} {k_p + k_d} + x_{t-1} \frac {k_d } {k_p + k_d} \\ &= \frac {a_t} {k_p + k_d} + \left(\frac {a_{t-1}} {k_p + k_d} + x_{t-2} \frac {k_d } {k_p + k_d} \right) \frac {k_d } {k_p + k_d} \\ &= \frac {a_t} {k_p + k_d} + \frac {k_d \ a_{t-1}} {(k_p + k_d)^2} + x_{t-2} \left(\frac {k_d } {k_p + k_d} \right)^2\\ &= \frac 1 {k_p + k_d} \sum_{i=0}^{t} \left(\frac {k_d } {k_p + k_d} \right)^{t-i} a_i\end{aligned}

Sigma-Delta modulation $\ \ a → Q → s → Q^{-1} → \hat a$

Sigma-Delta modulation is a quantization scheme and a form of noise shaping for converting high bit-count, low frequency signals into low bit-count, high frequency signals. Let’s look at how that works:

Because quantization $s = \operatorname{round}(a) \,$ loses information, we store the “leftover”, $\phi = a - s \,$ and add it at the next timestep $s = \operatorname{round}(\phi + a)$ .

So, starting with $\phi_0 = 0$ , we have

\begin{aligned} & s_t = \operatorname{round}(\phi_t + a_t) \\ & \phi_{t+1} = (\phi_{t} + a_t) - s_t\\\end{aligned}

Note: In general we round to the next integer. To ensure that we get binary spikes, i.e. $s \in {0, 1}$ we need $\phi_t + a_t ∈ [0.5, 1.5]$ and because $ϕ_t ∈ [-0.5, 0.5]$ we want $a_t ∈ [0, 1]$ which we can ensure by increasing the temporal resolution and tweaking $k_p$ and $k_d$ (see previous section).

But how can we reconstruct $a$ from this? To get a relation between $s$ and $a$ we can unroll the expression for $\phi_{t+1}$ for $n$ steps

\phi_{t+1} = (\phi_{t-1} + a_{t-1} - s_{t-1}) + a_t - s_t = \ ...\ = \phi_{t-n+1} + \sum_{i=t-n+1}^t a_i - \sum_{i=t-n+1}^t s_i

This gives us a relation between $∑a$ and $∑s$ which is a good starting point.

\sum_{i=t-n+1}^t a_i = \sum_{i=t-n+1}^t s_i \ \ + \phi_{t+1} - \phi_{t-n+1}

To get $a_t$ we have to assume $a_t = \text{constant}$ over a series of timesteps ${t-n+1, …, t}$ . Then, we can write

a_t = \tfrac 1 {n} \sum_{i=t-n+1}^t a_i = \tfrac 1 {n} \!\left( \sum_{i=t-n+1}^t s_i + \phi_{t+1} - \phi_{t-n+1} \right) = \underbrace{\tfrac 1 n \sum_{i=t-n+1}^t s_i}_\text{we can access} + \underbrace{\tfrac{\phi_{t+1} - \phi_{t-n+1}} n}_\text{error term}

For $n \to ∞$ we therefore have $a_t = \lim_{n \to ∞} \frac 1 n \sum_{i=t-n+1}^t s_i$ . Since $ϕ_t ∈ [-0.5, 0.5]$ and $𝔼[ϕ_t] = 0$ we can assume that error term is small even for small $n$ . The scale of the sum, on the other hand, is (up to the error term) proportional to $a$ . That means the signal-to-noise ratio of the reconstruction depends heavily on which scaling constant $\frac 1 {k_p + k_d}$ we use for $a$ .

Furthermore, the requirement for $a_t$ to be constant across many timesteps is not a real limitation. We can just increase time resolution and increase $n$ proportionally to make the error term small. So if $x$ changes too quickly we can just make our timesteps smaller.

To “decode” the quantization we therefore have to average the quantized signal. Conveniently the decoding scheme from the previous section already does this implicitly (approximately):

\begin{aligned} {\color{blue} x_t} &= c \sum_{i=0}^{t} \left(\frac{k_d}{k_p + k_d}\right)^{t-i} a_i \approx c \sum_{i=t-n}^t a_i = c \sum_{i=t-n}^t \frac{1}{n} \sum_{j=t-n}^t s_j \\ &= c \sum_{i=t-n}^t s_i \approx c \sum_{i=0}^{t} \left(\frac{k_d}{k_p + k_d}\right)^{t-i} s_i =: {\color{orangered}{\hat x_t}} \end{aligned}

Therefore we don’t need a decoder $Q^{-1}$ for the quantization such that we end up with the following pipeline.

\overset {\color{blue} x} ⟶ \text{enc} ⟶ \text{Q} \overset s ⟶ \text{dec} \overset {\color{orangered}{ \hat x}} ⟶

Below we can see what the combined signals look like for different encoding parameters.

Integer weight multiplication

Right now we have established more efficient communication between the neurons but still not incorporated the weight multiplication.

\cdots \underbrace{ {\!\!\!⟶} \ h \overset {\color{blue} x} → \text{enc} → \text{Q} }_\text{neuron a} \ \underbrace{\overset {s} ⟶ \vphantom{Q}}_\text{axon} \ \ \underbrace{\text{dec} \overset {\color{orangered}{ \hat x}} → w \vphantom{Q}}_\text{synapses} \ \ \ \underbrace{\overset{\!\!\! z} {\!\!\!⟶} h \ \cdots \vphantom{Q}}_\text{neuron b } \quad \text{(not what we want)}

So we have

z_t = w_t \ \hat x_t = w_t \ c \sum_{i=0}^{t} \left(\frac {k_d } {k_p + k_d} \right)^{t-i} s_i

Considering that $\hat x_t$ is just a weighted sum, if we assume $w_t = \text{constant}$ , we can pull it inside the sum

z_t ≈ c \sum_{i=0}^{t} \left(\frac {k_d } {k_p + k_d} \right)^{t-i} s_i w_t := \color{red}{\hat z_t}

Because $s_i$ is integer we have achieved our goal of replacing the floating point multiplication with a cheaper sparse integer multiplication! The approximation error we make with the assumption $w_t = \text{constant}$ depends on how fast we decay the weights inside the sum, i.e. how large $k_p$ is. Below is the final pipeline and a plot of the reconstruction for different $k_p$ .

\cdots \underbrace{ {\!\!\!⟶} \ h \overset {\color{blue} x} → \text{enc} → \text{Q} }_\text{neuron a} \ \ \ \underbrace{\overset {s} ⟶ \vphantom{Q} }_\text{axon} \underbrace{\color{orange}{w_t} \vphantom{Q}}_\text{synapses} \ \underbrace{\overset{\!\!\! } {\!\!\!⟶} \text{dec} \overset{\color{red}{\hat z}} \to h \ \cdots \vphantom{Q}}_\text{neuron b } \quad (\checkmark )

Learning the weights

To learn the weights we can apply the same coding scheme for backpropagation (by making the same assumptions). The symmetric backward pass through the transposed weights is not really biologically plausible but there is orthogonal work on biologically plausible backpropagation.

\!\!\!\!\!\!\!\!\!\!\!\! \cdots \underbrace{ {\!\!\!⟶} \ h \overset {\color{blue} x} → \text{enc} → \text{Q} }_\text{neuron a} \ \ \ \underbrace{\overset {\bar x = s} ⟶ \vphantom{Q} }_\text{axon} \underbrace{{w_t} \vphantom{Q}}_\text{synapses} \ \underbrace{\overset{\!\!\! } {\!\!\!⟶} \text{dec} \overset{{\hat z}} \to h \ \cdots \vphantom{Q}}_\text{neuron b } \quad \quad \quad \quad \ \ \ \tag{forward}

\cdots \underbrace{ {\!\!\!⟵} \ \vphantom{Q} h' \overset {} ← \text{dec} }_\text{neuron a} \ \ \ \underbrace{\overset {} ⟵ \vphantom{Q} }_\text{axon} \underbrace{{w^T} \vphantom{Q}}_\text{synapses} \ \underbrace{\overset{\!\!\! \bar e} {\!\!\!⟵} \text{Q} ← \text{enc} \overset{e} ← h' \ \ \vphantom{Q}}_\text{neuron b} \!\! \overset{∇_{x'} L} ⟵ \cdots \tag{backward}

This leads to an efficient backward pass but in order to update the weights with gradient descent we need to compute the outer product between the activations and the next pre-activation gradients: $∇_w L = x ⊗ e$ (where $e = ∇_z L$ ) which we both do not have access to.

The simplest solution would be to decode $\hat x = \text{dec}(s)$ and $\hat e = \text{dec}(\bar e)$ before the inner product.

\widehat{∇_w L}_\text{recon} = \hat x ⊗ \hat e

Then we still have an expensive floating point multiplication, however. Instead we can use the fact that the result of the decoder $s → \text{dec} → \hat x$

\text{dec:} \ \ \ \ \hat x_t = \frac {s_t + k_d \hat x_{t-1}} {k_p + k_d}

decays exponentially in absence of spikes (i.e. $s_t = 0$ ). Therefore we can calculate the sum over time between two spikes (pre-synaptic or post-synaptic) analytically as a sum over a geometric series. It is fine to sum the gradients over time and apply it as an update later, because that is what SGD does anyway.

\begin{aligned} \sum_{i = t-n}^t \hat x_i \hat e_i &= \sum_{i = t-n}^t \left(\tfrac {k_d}{k_p+k_d}\right)^{2(t-i)} \hat x_{t-n} \ \hat e_{t-n} \\& = \hat x_{t-n} \hat e_{t-n} \sum_{j = 0}^n \Big( \ \underbrace{\left(\tfrac {k_d}{k_p+k_d}\right)^{2}}_{=r} \ \Big)^{j} \\ &= \hat x_{t-n} \hat e_{t-n} \tfrac {1 \ - \ r^{n+1}}{1-r}\end{aligned}

Here, $t-n$ is the time at which the last spike occurred. If another spike occurs for either $\bar x$ or $\bar e$ we just add that sum to the corresponding weight (multiplied by the learning rate). That is called “past updates” in the paper.

Summary

We looked at the forward pass, backward pass and weight updates from the paper and legitimized every step with solid math (most of which can also be found in the paper). This revealed the assumptions that had to be made and requirements on the hyperparameters as well as possible extension points to the method.