Optimistic-Greedy Reward Policy
A policy that starts with very high Q-values and choose greedily. We normally start with Q-values of 0 and increase them as we learn. This approach starts by taking the best path first and then decreasing Q-values to try and find a better path.