Q-learning

Tags: adaptive-intelligence

A [see page 11, variant] of the SARSA algorithm that instead of deferring to the policy for the future-reward action, instead always uses the greedy policy (pick the future action which maximises the reward policy).

Known as an off-policy algorithm because it doesn't depend on the policy.