Reinforcement Learning

Tags: adaptive-intelligence

A category of learning algorithms that rewards or punishes actions to reinforce beneficial [see page 15, behaviour]. Consider a game, at each point in time the game is in some state and the system can perform some action to move it to a new state. Learning the actions that lead to a good outcome (maximise reward) is how reinforcement learning works.

The goal of reinforcement learning is:

Updating the parameters so that the total reward is maximised.

Warn: Reinforcement learning can classify an action as positive or negative but not the exact degree of rightness.

Take for example the board game go, when teaching an AI to learn go we don't want it to learn **exactly** which moves to make, we want it to learn from the results of its actions whether they had a net positive/negative contribution to the end result. It's a game of **reward and punishment**.

Formulation

A reinforcement learning system is defined with 3 core components:

An see page 8, agent performing actions
List of available actions for each state
Reward function, associating the result of an action with some quantitative value.

Samples are supplied as: \[ \{({s}^{1}, {a}^{1}, {r}^{11}), ({s}^{1}, {a}^{2}, {r}^{12}), ({s}^{2}, {a}^{1}, {r}^{21}), \ldots \} \] Where we [see page 7, define]:

\( s \) is the state of the game before a given action.
\( a \) is an action the agent can take.
\( R^a_{s \longrightarrow s'} \) reward for performing an action \( a \) from a given state \( s \) to go to a new state \( s' \).
\( s' \) is the state after performing an action.

Given these constructs and a reward policy we've got a reinforcement learning system.

Comparison to Supervised Learning

Compared to supervised learning, reinforcement learning [see page 9, is]:

Feedback is less direct. For example with supervised we know the exact contribution of each weight to the error and can update the parameters appropriately but with reinforcement learning the reward guides the learning but less precisely.
Maximising rewards instead of minimising error. Note: We can set \( R = 1 - E \)
Reinforcement learning can be considered the extreme of supervised learning where there is only 1 binary output.

Links to this note

Reinforcement Neural Network