Q-values
We [see page 4, define] the \( Q \) value: \[ Q^{*}(s,a) = \sum_{s'} P^a_{s \longrightarrow s'} R^a_{s \longrightarrow s'} \] as the true expected reward taking an action \( a \) in a state \( s \) where:
- \( P^a_{s \longrightarrow s'} \) is the probability of moving from state \( s \) to \( s' \).
- \( R^a_{s \longrightarrow s'} \) is the reward of moving from state \( s \) to \( s' \).
This is essentially a summation over all the expected output states of an initial state using all known actions. If we fix the starting state \( s \) (such as when deciding an action to take) we can use the Q-value as a measure of the expected reward of an action. See [see page 5, example].
Approximate Q-Values
In [see page 7, practice] we rarely have an exact measure of the true rewards and probabilities of a system. Therefore we often use the predicted/expected reward of taking an action \( a \) in state \( s \): \( Q(s, a) \).
We learn \( Q \) by repeating the actions from a given state multiple times and finding the expectation of it.
\begin{align*}
Q_K(s,a) =& \frac{1}{K} \sum_{t}^{K} R_t = \frac{r_1 + \ldots + r_k}{K} \
Q_{K+1}(s,a) =& \frac{1}{K+1} \sum_{t}^{K+1} R_t = \frac{r_1 + \ldots + r_k + r_{k+1}}{K+1}
\end{align*}
which can equivalently be defined using the update rule: \[ \Delta Q(s,a) = \eta (r - Q(s, a)) \]