Artificial Neurons

Tags: adaptive-intelligence

A mathematical model of the neuron.

Excitation

The excitation equation follows the same form as a regular neuron, just with some of the fields being named differently.

\begin{align} h_i &= \sum_{j}{{w}_{ij} \times x_j} \
y_i &= f(h_i) \end{align}

Matrix Notation

Observe that we have a collection of input neurons which we enumerate using the \( j \) variable. We multiply each input with a corresponding weight of the connection to that neuron and sum the result. This operation of multiply-accumulate is essentially matrix multiplication \( y = f(Wx) \).

The above element notation can be written equivalenetly for an entire layer of input neurons using matrix notation:

\begin{align} h &=

    \begin{pmatrix}
        w\_{11} & w\_{12} & w\_{13} & w\_{14} \\\\
        w\_{21} & w\_{22} & w\_{23} & w\_{24} \\\\
        w\_{31} & w\_{32} & w\_{33} & w\_{34} \\\\
        w\_{41} & w\_{42} & w\_{43} & w\_{44} \\\\
    \end{pmatrix}
    \times
    \begin{pmatrix}
        x\_1 \\\\
        x\_2 \\\\
        x\_3 \\\\
        x\_4
    \end{pmatrix} \\\\
&=
    \begin{pmatrix}
        h\_1 \\\\
        h\_2 \\\\
        h\_3 \\\\
        h\_4
    \end{pmatrix} \\\\

\end{align}

Here we have an input layer of 4 neurons propagating to an output layer of 4 neurons. The rows of the weight matrix correspond to the input layer connections and the columns equal the number of output layer neurons.

Once we've propagated through the network network we also have to apply the activation-function to get \( y_i \).

\begin{align*} y &=

    \begin{pmatrix}
        f(h\_1) \\\\
        f(h\_2) \\\\
        f(h\_3) \\\\
        f(h\_4)
    \end{pmatrix} \\\\
&= \ldots

\end{align*}

Transformation Functions

The transformation function is a function that is given the output-potential of a neuron and transforms it non-linearly to approximate the excitation of a true neuron.

Each of the following excitation functions use an implied bias value. I.E. minimum threshold. For any activation function \( f(x) \) we actually run \( f(x - \Theta) \).

Sigmoid

\[ \sigma{x} = \frac{1}{1 + {e}^{-x}} \] Non-linear, continuous (finite gradient) and saturates for extreme values (low gradient for high \( x \) values).

Hyperbolic Tangent

\[ tanh(x) \] Similair to sigmoid but uses a bias input weight for training the threshold.

ReLU

\[ max(0, x) \] Constant gradient for inputs greater than 0.

With ReLU the gradients don't disappear as much as Sigmoid or TanH so we generally prefer it for deeper networks.

The main issue with ReLU is below 0 it has no gradient.

Dead and Leaky ReLUs

A dead ReLU outputs 0 for any input (probably due to a learning a large negative bias term for its weights). Such a neuron takes no role in discriminating inputs for classification and is also unlikely to recover because the gradient for a ReLU at this point is also 0. We update the weights by moving along the negative gradient, but if the gradient is 0 we never update the weights.

Note: Both the sigmoid and hyperbolic-tangent functions also suffer from this issue, but because they always have some gradient (albeit a small one) they can recover in the long term.

A leaky ReLU is a ReLU that instead of returning having a 0-gradient below 0 inputs, has a small gradient.

Brain Dump