Oja's Rule

Tags: adaptive-intelligence

A hebbian rule which uses a [see page 5, homeostatic] term to stabilise the change in weights ([see page 13, decreasing] the weights in relation to the strength of the output).

\begin{align} \Delta{w_{ij}} = \alpha v_i^{\text{post}} v_j^{\text{pre}} - \alpha w_{ij}(v_i^{{\text{post}})}2 \end{align}

Note: We can [see page 4, rewrite] ojas rule using the equation for the output of a neuron to get: \[ \Delta{w_{j}} = \alpha y (x_j - w_j y) \] We also need to [see page 5, update] the equation for the average weight update.

Nueron Correlation

As we've already established the minimal Hebbian-rule detects correlations in the input; we can prove that the homeostatic term in Oja's rule also depends on this meaning Oja's rule also detects correlation in the input.

For this we divide Oja's rule into two separate components:

\begin{align} \Delta{w_{ij}} &= \alpha v_i^{\text{post}} v_i^{\text{pre}} - \alpha w_{ij} (v_i^{{\text{post}})}2 \\

        &= \Delta{w\_{ij}^{+}} + \Delta{w\_{ij}^{-}}

\end{align}

Where:

\begin{align} \Delta{w_{ij}^{+}} &= \alpha v_i^{\text{post}} v_i^{\text{pre}} \label{eq:oja/plus} \
\Delta{w_{ij}^{-}} &= - \alpha w_{ij} (v_i^{{\text{post}})}2 \label{eq:oja/minus} \end{align}

Note: We've already proven that \( \Delta{w_{ij}}^{+} \) depends on the correlation of the input so our goal is to show that \( \langle \Delta{w_{ij}^{-}} \rangle \) depends on the correlation of the inputs.

By [see page 8, substituting] the neuron-model equation into eq:oja/minus and then [see page 9, expanding] the square term we get:

\begin{align*} \Delta{w_{ij}^{-}} &= - \alpha w_{ij}(\sum_{k} w_{ik}v_{k}^{{\text{pre}})}2 \\

            &= - \alpha w\_{ij}(\sum\_{k} w\_{ik}v\_{k}^{\text{pre}})^2 (\sum\_{k} w\_{il}v\_{l}^{\text{pre}})^2 \\\\
            &= - \alpha w\_{ij} \sum\_{k} \sum\_{l} w\_{ik} v\_k^{\text{pre}} v\_l^{\text{pre}} w\_{il}

\end{align*}

We can then average over all input patterns in an iteration to get:

\begin{align*} \langle \Delta{w_{ij}^{-}} \rangle &= \frac{1}{P} \sum_{\mu = 1}^{P} \Delta{w_{ij}^{-\mu}} \\

                &= - \frac{1}{P} \sum\_{\mu = 1}^P \alpha w\_{ij} \sum\_{k} \sum\_{l} w\_{ik} v\_k^{\text{pre}} v\_l^{\text{pre}} w\_{il} \\\\
                &= - \alpha w\_{ij} \sum\_{k} \sum\_{l} w\_{ik} (\frac{1}{P} \sum\_{\mu=1}^{P} v\_k^\mu v\_l^\mu) w\_{il} \\\\
                &= - \alpha w\_{ij} \sum\_{k} \sum\_{l} w\_{ik} \langle v\_k v\_l \rangle w\_{il}

\end{align*}

Recall that \( \langle v_k v_l \rangle \) is the form as the correlation of the neurons \( k,l \). Therefore we can conclude that because \( w_{ij}^{+} \) and \( w_{ij}^{-} \) both detect correlation in the inputs, Oja's rule also detects correlations in the input.

In [see page 12, Practice]

Oja's rule quickly decays uncorrelated (example: random) input down to a very small value (because in reality there's no correlation driving a change in weights, just a rapid amount of firing and misfiring over a short period of time). With correlated input we still observe a decay but it isn't as massive compared to uncorrelated input.

Properties

Oja's rule converges to a weight-vector with the following [see page 3, properties]:

Is an eigenvector of the correlation matrix.
Length equal to 1
Is the eigenvector with the largest eigenvalue.

Proof of Properties

Oja's rule converges when \( \langle \Delta{w_{ij}} \rangle = 0 \):

\begin{align} 0 &= \alpha (\sum_{k} w_k \langle v_k v_j \rangle - w_j \sum_{k,n} w_k \langle v_k v_n \rangle w_n) \
\sum_{k} w_k \langle v_k v_j \rangle &= w_j \sum_{k,n} w_k \langle v_k v_n \rangle w_n \label{eq:oja-convergance} \end{align}

Observe that in [see page 7, matrix-notation] we can re-arrange both the left and right components of equation eq:oja-convergance into eq:oja-convergance/left and eq:oja-convergance/right.

\begin{align} \sum_{k} w_k \langle v_k v_j \rangle &= wC \label{eq:oja-convergance/left} \
w_j \sum_{k,n} w_k \langle v_k v_n \rangle w_n = w (wCw^T) &= w \lambda \label{eq:oja-convergance/right} \end{align}

where \( w \) is the weight vector, \( C \) is the correlation matrix, and \( \lambda \) some scalar calculated using the weight vector and correlation matrix.

Note: In practice we assume when calculating the average change in weights \( \langle w_{ij} \rangle \) we only change the weights at the end of sending every sample through the network which is why we can use the same weight vector in multiple loops even though the weights would generally change every iteration.

Now lets [see page 9, consider] the original formula we'd like to solve once again:

\begin{align} \sum_{k} w_k \langle v_k v_j \rangle &= w_j \sum_{k,n} w_k \langle v_k v_n \rangle w_n \
wC &= w \lambda \label{eq:oja-eigen} \end{align}

Observe that this is exactly the same form as an eigenvector relation. If we then multiply each side of eq:oja-eigen by \( w^T \):

\begin{align} (wC) w^T &= (w \lambda) w^T \
\lambda &= |w|^2 \lambda \label{eq:oja-eigen/id} \
1 &= |w|^2 \
1 &= |w| \end{align}

We find that the length of the eigenvector \( w \) is 1.

Note: In step eq:oja-eigen/id we substitute \( wCw^T = \lambda \) and reorder \( \lambda \) to before a matrix multiplication because it's a scalar.

Proof of Stability

Oja's rules [see page 9, magnitude] is also stable.

TODO: Copy over proof of [see page 9, dynamical-stability] of Oja's rule. TODO: Cleanup

[see page 11, here] we divide our weight vector into two eigen vector (equivalently can be seen as taking the weight eigenvector, adding an offset (the \( b \) term) and then subtracting that offset to get back to our original weight vector). The goal is to show that the \( b \) term is always decreasing back to our original weight.

Note: Each output neuron has the dynamical stability property, meaning if we have a simple feedforward network we can have N output neurons but they'll all converge to the same PCA. We need some interaction between output neurons to avoid this.

Multiple Output Neurons

With [see page 4, multiple] output neurons the weights will each converge to the same output vector. Meaning both neurons will learn the same weights and have the same behaviour. There needs to be some sort of interaction between output-neurons such as with competitive learning to prevent them all learning the same behaviour.

Because of this the notes above act as if we have only one output neuron and thus removes extraneous representation (such as \( w_k \) to represent the weight between an input neuron \( k \) and the single output neuron \( i \)).