Soft-Max Reward Policy

Tags: adaptive-intelligence

Is a policy which associates each action to a probability based on the natural exponent of its Q-value. \[ P(a) = \frac{e^{Q(s, a) / \tau}}{\sum_{b} e^{Q(s, b) / \tau}} \] Where \( \tau \) is a psuedo-temperature which balances out large \(Q\) values.

The denominator here is the normalising term (dividing by the sum of the numerator for each action) to scale the growth of the policy.