Greedy Reward Policy
A reward policy which always [see page 10, chooses] the action that has the maximum Q-value: \[ Q(s, a_i) > Q(s, a_j) \forall j != i \]
But this conflicts when we [see page 11, don't know] our Q-values:
- ...but Q-values depend on the probability (and reward) of taking an action
- ...but if we choose our actions based on our policy, our policy can't rely on knowing the \( Q \) values beforehand.
We have to explore in order to estimate reward (Q) probabilities but we have to take actions that appears to optimise our reward which would influence our reward probabilities.
Self Support
This approach suffers from self-support. The learning algorithm can get a positive result from one action and then always take that action because it's positive instead of exploring other options which may be better.