Deep Feed-Forward Neural Network
An extension of single layer perceptrons, where we can have an [see page 3, arbitrary] number of intermediate layers between the input and output layers. We colloquially refer to all layers hidden in the middle of the input and output layers as hidden-layers.
Training
Training a multi-layer perceptron is much the same as a single layer perceptron except when we alter the weights for one layer, say \( n-2 \), we must consider how that change affects the error for layer \( n-1 \) and \( n \) because:
A [see page 10, change] in output for a neuron in a layer \( n-2 \) (due to altering the weights) leads to a change in input for all neuron in layer \( n-1 \) which alters that layers outputs, affecting layer \( n \) and so on and so-forth. Each of these alterations has some affect on the error which we need to account for.
For this training we separate our neural network into 3 sections:
- The first-layer is the input-layer, here we just load the input sample
- Any other-layers before the final-layer are hidden-layers
- The final-layer for a perceptron is the output-layer
We update our output and potential-output equations from before to be:
\begin{align*}
h_i^{(k)} &= \sum_{j} {w}_{ij}^{(k)} {x}_{j}^{(k-1)} \
x_i^{(k)} &= f(h_i^{(k)}) \
\end{align*}
For some layer \( k \). Note \( h_i^{(k)} \) is defined recursively. Each layer depends on the output of the previous layer.
Our definition for our error function remains relatively unchanged as it depends only on the output of our network and the expected output \( t_{\mu,i} \):
\begin{align} E = \frac{1}{2N} \sum_{\mu} \sum_{i} (t_{\mu,i} - x_{\mu,i}{(n)})2 \label{eq:ef} \end{align}
Note Recall the final layer in our network is \( n \) therefore \( x^{(n)} = x^{\text{out}} \). Therefore this is essentially the sum of the square of the difference of each output layer neuron from its expected value (averaged).
Now we must define our back-propagation update rule for eq:ef using the chain-rule formula:
\begin{align} \label{eq:up-eq} \frac{\delta{E}}{\delta{w_{jl}^{(n-1)}}} = \sum_{\mu} \frac{\delta{E}}{\delta{x_{\mu, j}^{(n-1)}}} \frac{\delta{x_{\mu, j}{(n-1)}}}{\delta{w_{jl}{(n-1)}}} \end{align}
Separating this out we get essentially the same form as before, we still want to find how the error changes with a change in weight. Where this differs from the previous form is that its parameterised in terms of the output of the previous layer \( n-1 \) (the input to the \( n^{\text{th}} \) layer). The left section of the chain-expansion in eq:up-eq specifies how the error changes as we change the hidden-layer inputs. The right section specifies how the hidden layer inputs change with the weights.
We resolve the [see page 11, left] of eq:up-eq to be:
\begin{align} \label{eq:up-eq/left} \frac{\delta{E}}{\delta{x_{\mu, j}^{(n-1)}}} &= - \frac{1}{N} \sum_{\mu} \sum_{i} (t_{\mu,i} - x_{\mu,i}^{(n)}) \frac{\delta{x_{\mu,i}{(n)}}}{\delta{x_{\mu,j}{(n-1)}}} \\
&= - \frac{1}{N} \sum\_{\mu} \sum\_{i} (t\_{\mu,i} - x\_{\mu,i}^{(n)}) f'(h\_{\mu,i}^{(n)})w\_{ij}^{(n)} \\\\
\end{align}
The [see page 13, right] section is resolved using another chain-rule:
\begin{align} \frac{\delta{x_{\mu, j}{(n-1)}}}{\delta{w_{jl}{(n-1)}}} &= \frac{\delta{f(h_{\mu,i}{(n)})}}{\delta{h_{\mu,i}{(n-1)}}} \frac{\delta{h_{jl}{(n-1)}}}{\delta{w_{jl}{(n-1)}}} \\
&= f'(h\_{\mu,j}^{(n-1)})x\_{\mu,l}^{(n-2)} \label{eq:up-eq/right}
\end{align}
We can substitute eq:up-eq/left and eq:up-eq/right back into eq:up-eq to find the final update-rule:
\begin{align} \frac{\delta{E}}{\delta{w_{jl}^{(n-1)}}} = - \frac{1}{N} (\sum_{\mu} \sum_{i} (t_{\mu,i} - x_{\mu,i}^{(n)}) f'(h_{\mu,i}{(n)})w_{ij}{(n)}) f'(h_{\mu,j}{(n-1)})x_{\mu,l}{(n-2)} \end{align}
And the subsequent delta rule.
\begin{align} \delta{w_{jl}^{(n-1)}} &= - \eta \frac{\delta{E}}{\delta{w_{jl}^{(n-1)}}} \\
&= - \eta (- \frac{1}{N} \sum\_{\mu} \sum\_{i} (t\_i(\mu) - x\_i^{(n)}) f'(h\_\mu,i)^{(n)} w\_{ij}^{(n)} f'(h\_{\mu,j}^{(n-1)})x\_l^{(n-2)}) \\\\
&= \frac{\eta}{N} \sum\_{\mu} \sum\_{i} \delta\_{\mu,i}^{(n)} f'(h\_{\mu,j}^{(n-1)})x\_l^{(n-2)}
\end{align}
Where:
\begin{align*} \delta_{\mu,j}^{(n)} &= \text{Local gradient for an output-layer neuron \( j \)} \\
&= (t\_i(\mu) - x\_i^{(n)}) f'(h\_{\mu,i}^{(n)})
\end{align*}
\begin{align*} \delta_{\mu,j}^{(n-1)} &= \text{The local gradient for a hidden layer neuron \( j \)} \\
&= f'(h\_{\mu,j}^{(n-1)}) \sum\_{i} \delta\_{\mu,i}^{(n)} w\_{ij}^{(n)}
\end{align*}
Note Using the local-gradient for a hidden layer we can [see page 15, write] a nicer delta-rule for only hidden-layers as: \[ \Delta{w_{jl}^{(k-1)}} = \frac{\eta}{N} \sum_{\mu} \delta_{\mu,j}^{(k-1)} x_{\mu,l}^{(k-2)} \] The delta-rule for an output layer neuron can stay the same as a single-layer perceptron.
Walkthrough
That was just a bunch of garbled math so an interactive example should be more informative. See the [see page 7, walk-through] of a single run through this kind of a network, including error-calculation and updating weights.
TODO: Custom walkthrough.
Deep vs. Shallow
A deep neural network is one with many layers
Network | Description |
---|---|
Deep | One with many layers |
Shallow | One with 1 very massive layer |
In principle 1 large shallow layer would be sufficient for a network however in practice many-more layers are better.
Especially when [see page 13, using] the RELU activation function with deep networks.
[see page 17, Take Home Message]
- Adding hidden-layers allows for higher-dimension features to be extracted for better classification (or regression).
- The mechanisms of passing back gradients for each neuron using the same synaptic weights is not seen as biologically plausible.
- As we add more layers, repeat applications of \( f' \) can lead to very small changes in weights.