Reward Policy

Tags: adaptive-intelligence

Refers to the [see page 12, algorithm] we use to decide which action to take to maximise our reward in a decision problem. Policies are often defined using an adaption of Q-values.

Exploration vs. Exploitation

The [see page 11, goal] of a policy isn't simply maximising reward but also to explore other possibly sub-optimal actions in the hopes of discovering a more optimal actions in the long run.

The environment of problems are generally stochastic and complex, at the beginning of the process we do not know the Q-value and must estimate them by exploring... however ultimately we want to maximise reward which can only be done by exploiting our current knowledge.

Exploitation is when we greedily rely solely on the actions which our history has shown is good despite better options potentially being available.

The need for exploration is explained more succinctly in the greedy reward policy.

Links to this note

Epsilon-Greedy Reward Policy
Greedy Reward Policy
Optimistic-Greedy Reward Policy
Reinforcement Learning
Soft-Max Reward Policy