SARSA Algorithm
A reinforcement neural network algorithm which considers future-rewards.
Note: SARSA is an abbreviation for State->Action->Reward->State->Action.
Think two steps ahead to win one step in advance.
TODO: Mention why future-rewards are needed, because the reward from an action could be 0 with the classic formulation above leading to a 0 value for \( \Delta{Q} \) meaning \( Q \) will never change if the initial \( Q \) values are 0.
Known as an on-policy algorithm because of step 3 in formulation below.
Formulation
The SARSA [see page 8, algorithm] considers for each action the rewards from the subsequent action after that. Steps:
- Being at state \( s \) take action \( a \) based on the policy.
- Observe the reward, \( r \), from the chosen action and the next state \( s' \).
- From state \( s' \) take action \( a' \) based on the policy.
Update the Q values using:
\begin{align} \label{eq:delta-q} \Delta Q(s,a) = \eta [r - (Q(s,a) - \gamma Q(s', a'))] \end{align}
Change \( s \) to \( s' \) and \( a \) to \( a' \).
Go back to 1 and repeat until the end of the experiment.
You can see this algorithm as [see page 8, psuedocode].
If we define the latter half of eq:delta-q as:
\begin{align} \delta = r - (Q(s,a) - \gamma Q(s', a')) \end{align}
We can specify our weight update equation as:
\begin{align} \Delta w_{ij} = \eta \delta y_i x_j \end{align}
Discount Factor
\( \gamma, 0 < \gamma \leq 1 \) is the discount factor, essentially a tuning parameter used to specify how we compare future rewards from current rewards. Consider for example is £10.00 now better than £10.00 in an hour? Clearly if the two are equivalent then we have no need to wait.
Eligibility Trace
The SARSA algorithm only looks at the next state-action so it has a runtime behaviour of \( \mathcal{O}(N_{\text{states}} \times N_{\text{actions}}) \).
Eligibility traces try to improve this asymptotic runtime by rewarding intermediate actions in the updates of a neural network; by [see page 8, recording] each intermediate action in a matrix. This matrix is decreased as newer actions are taken, such that newer actions have the largest trace and older ones have the smallest. When we update the Q-values for the network we include a scaled form of the error of the eligibility trace.
We use \( \lambda \) to control the rate of decay for the trace. Too large and unhelpful actions will be rewarded, too low and it may not adequately reward beneficial actions.
We use the same general formulation as before, except now we also:
If action a is taken: \[ \text{trace}(s, a) \longleftarrow \text{trace}(s, a) + 1 \] For all actions:
\begin{align*}
\text{trace}(s, a) &\longleftarrow \eta \lambda \text{trace}(s, a) \
Q(s,a) &\longleftarrow Q(s,a) + \eta \delta \text{trace}(s,a)
\end{align*}
Addendum: Deriving eq:delta-q Through The Markov Property
eq:delta-q was stated above in the algorithm for SARSA, however in this section we'll actually derive it. To do so we'll be using several properties of our system as maths including:
- Markov Property.
Policy as Probability: the probability we take a given action in a given state.
\begin{align} \label{eq:policy-prob} \pi (s,a) = P(a_t = a | s_t = s) \end{align}
Transition Probabilities: the probability of of ending up in state \( s' \) given we're currently in state \( s \) and take action \( a \).
\begin{align} \label{eq:transition-prob} P^a_{ss'} = P({s}_{t+1} = s' | s_t = s, a_t = a) \end{align}
Expected Rewards: The reward we expect to get for taking a given transition.
\begin{align} \label{eq:expected-reward} R^a_{ss'} = E({r}_{t+1} | s_t = s, a_t = a, {s}_{t+1} = s') \end{align}
We define the [see page 7, bellman equation] as the sum of the all weighted future rewards (considering only the rewards from state/action transitions that lead to the maximum overall reward):
\begin{align} \label{eq:bellman} R_t = {r}_{t+1} + \lambda {r}_{t+2} + {\gamma}^{2} {r}_{t+2} + \ldots = \sum_^{\infty} = {\gamma}^{k}{r}_{t+k+1} \end{align}
In eq:bellman \({r}_{t+2}\) is the reward from the next state after the current state and is assumed to be the highest reward we could've received for that state.
Using eq:bellman we can define our Q-values as the expected total future reward beginning from the current state and taking a known action.
\begin{align} \label{eq:bellman-q-value} {Q}^{\pi}(s,a) = E_\pi ({R}_{t} | s_t = s, a_t = a) \end{align}
Using the linearity of expectation we can split eq:bellman-q-value into [see page 8, separate] expectation sums:
\begin{align} {Q}^{\pi}(s,a) &= E_\pi ({r}_{t+1} + \gamma {r}_{t+2} + {\gamma}^{2} {r}_{t+2} + \ldots | s_t = s, a_t = a) \\
&= E\_\pi ({r}\_{t+1} | s\_t = s, a\_t = a) + \gamma E\_\pi ({r}\_{t+2} + \gamma {r}\_{t+2} + \ldots | s\_t = s, a\_t = a) \\\\
& \begin{aligned}
&= \sum\_{s'} {{P}^{a}\_{ss'} E\_\pi ({r}\_{t+1} | s\_t = s, a\_t = a, {s}\_{t+1} = s')} \\\\
&+ \sum\_{s'} {\gamma {P}^{a}\_{ss'} E\_\pi ({r}\_{t+2} + \gamma {r}\_{t+3} + \ldots | s\_t = s, a\_t = a, {s}\_{t+1} = s')} \\\\
\end{aligned} \\\\
&
\label{eq:bellman-q-value-split}
\begin{aligned}
&= \sum\_{s'} {{P}^{a}\_{ss'} {R}^{a}\_{ss'}} \\\\
&+ \sum\_{s'} {\gamma {P}^{a}\_{ss'} E\_\pi ({r}\_{t+2} + \gamma {r}\_{t+3} + \ldots | s\_t = s, a\_t = a, {s}\_{t+1} = s')} \\\\
\end{aligned}
\end{align}
In eq:bellman-q-value-split we've introduced the expected rewards formula from eq:expected-reward from above.
Now we can further [see page 10, divide] the latter half of eq:bellman-q-value-split through the Markov Property to depend solely on the current state.
\begin{align} {Q}^{\pi}(s,a) &
\begin{aligned}
&= \sum\_{s'} {{P}^{a}\_{ss'} {R}^{a}\_{ss'}} \\\\
&+ \sum\_{s'} {\gamma {P}^{a}\_{ss'} E\_\pi ({r}\_{t+2} + \gamma {r}\_{t+3} + \ldots | s\_t = s, a\_t = a, {s}\_{t+1} = s')} \\\\
\end{aligned} \\\\
&
\begin{aligned}
&= \sum\_{s'} {{P}^{a}\_{ss'} {R}^{a}\_{ss'}} \\\\
&+ \sum\_{s'} \sum\_{a'} \gamma {P}^{a}\_{ss'} \pi(s', a') {E}\_{\pi} ({r}\_{t+2} + \gamma {r}\_{t+3} + \ldots | {s}\_{t+1} = s', {a}\_{t+1} = a')
\end{aligned} \\\\
&
\begin{aligned}
\label{eq:bellman-q-value-final}
&= \sum\_{s'} {{P}^{a}\_{ss'} {R}^{a}\_{ss'}} \\\\
&+ \sum\_{s'} \sum\_{a'} \gamma {P}^{a}\_{ss'} \pi(s', a') {Q}^{\pi}(s', a')
\end{aligned} \\\\
\end{align}
eq:bellman-q-value-final [see page 12, relates] the Q-values for the current state to the Q-values for the future state. One final alteration we can make is substitute the optimal greed -policy in place of \( \pi \) to get what is known as [see page 13, bellmans optimality equation].
\begin{align} \label{eq:bellman-optimality} {Q}^{*}(s,a) &= \sum_{s'} {P}^{a}_{ss'} {R}^{a}_{ss'} + \sum_{s'} \gamma {P}^{a}_{ss'} \max_{a'}{{Q}^{*}(s', a')} \\
&= \sum\_{s'} {P}^{a}\_{ss'} ({R}^{a}\_{ss'} + \gamma \max\_{a'} {Q}^{\*} (s', a'))
\end{align}
Recall that the expectation \( E \) we use in eq:bellman-optimality is simply an average of the rewards received through a given action in a given state, this means eq:bellman-optimality can be [see page 18, re-structured] to through the exponential moving average to get our update rule.
\begin{align}
{Q}_{n+1} (s,a) &= (1 - \eta) {Q}_{n}(s,a) + \eta (r + \gamma \max_{a'} Q_n (s', a')) \
{Q}_{n+1} (s,a) - Q_n (s,a) &= \eta (r - Q_n (s,a) + \gamma \max_{a'} Q_n (s', a')) \
\Delta Q(s,a) &= \eta (r - Q_n (s,a) + \gamma \max_{a'} Q_n (s', a'))
\end{align}
Deep Q-Learning
An extension of the SARSA learning algorithm which uses deep neural networks to map more complex input into expected rewards for taking certain actions.
Formulation
We have [see page 7, new] updated error-function and update-rules:
\begin{align*} E(q) = \frac{1}{2} ((r + \lambda Q(s', a', w)) - Q(s, a, w))^2 \end{align*}
\begin{align*} \Delta{{w}_{ij}^k} = - \eta \frac{\delta E}{\delta {w}_{ij}^k} \end{align*}
TODO: Remaining [see page 9, calculations].
TODO Copy in SARSA algorithm demonstration from Lab unicom3240
Lecture from April 20 2021, beginning from roughly 00:50. The idea you should note is that overtime (iterations) the Q values for earlier state transitions in the [see page 1, exercise] sheet gradually update to match the optimum Q values.
Also note:
\( O(n) \) steps to learn Q-values for \(n\) states.
Mention reference to multiple policys, one for the behaviour policy and one for the target policy (used to try to learn the optimal polciy).
See [see page 7, speed of learning].