Reinforcement Neural Network
Is a class of reinforcement learning algorithms using neural networks.
Formulation
Formulated by [see page 6, encoding] our state in our pre-synaptic neurons and the actions we can take as the post-synaptic neuron. We can then define \[ Q(s,a) = w_a x \] as the [see page 4, rate of postsynaptic return], where \( w_a \) is the weight vector connecting the current state to the action neuron \( a \).
Update Rules
With this formulation we [see page 8, define] our error function as: \[ E(a) = \frac{1}{2} (r - Q(s,a))^2 \]
The [see page 9, derivative] of this function is:
\begin{align} \frac{\delta E(a)}{\delta w_a} &= -(r - Q(s,a)) \frac{\delta Q(s,a)}{\delta w_a} \\
&= -(r - Q(s,a)) \frac{\delta (w\_a x)}{\delta w\_a} \\\\
&= -(r - Q(s,a)) x^T \label{eq:err-grad}
\end{align}
Which can be used to define the weight update rule:
\begin{align} \Delta{w_a} &= - \eta \frac{\delta E(a)}{\delta w_a} \\
&= \eta (r - Q(s,a)) x^T \label{eq:weight-delta}
\end{align}
Addendum: Taking an action
In this formulation we only want to consider the contribution from the action we take and disregard any other actions we didn't take. To do so we [see page 10, define] a new vector matching the number of actions available:
\begin{align} \label{eq:choice} y_a = \begin{cases} 1 & \text{a = action taken}\\ 0 & \text{a = actions not taken} \end{cases} \end{align}
Now we can substitute eq:choice back into eq:weight-delta to get a weight update equation that only updates the weight of the chosen action. \[ \Delta w = \eta (r - Q(s,a)) y x^T \]
Relation to Hebbian Learning
Observe that the delta-rule for this network can be [see page 11, rephrased] as: \[ \Delta w = \eta (\text{reward} - \text{[expected reward]}) \times \text{post-synaptic activity} \times \text{pre-synaptic activity} \] Where \( \text{reward} - \text{[expected reward]} \) is the global reward signal of the network.
This learning rule [see page 14, matches] the form of a hebbian rule problem, with two neurons (state -> action) that fire together reinforcing themselves. Note however that this rule lacks an error signal and instead relies on a globally available (independent of \( a \)) reward signal.
We [see page 14, require] both Pre-synaptic and Post-synaptic activity, and that activity is conditioned on the reward.
Comparison to Unsupervised Learning
See [see page 15, comparison] to unsupervised learning with receptive fields.
With unsupervised learning we exploit statistical correlations, there's a positive increase at correlated locations allowing us to identify certain correlation in the input.
With reinforcement learning we instead learn the new behaviour directly. We change in ways that maximise rewards and try to avoid low or negative reward behaviour.