The part that is confusing you is the Bellman approximation
which is used to update the Q-values
of a state that is defined as s
given an action a
is taken.
Q
for this state, s
, and action, a
, equals the expected immediate reward and the discounted long-term reward of the destination state.
We take the maximum of the values of the Q-values(or the value of the action)
of being at state s'
which is the next state going from state s
, with an action a'
, as the actions we can take when going from a state s
to a state s'
are a set of mutually exclusive discrete set (i.e., your environment allows you to move in the direction up, left, right or down) and the most optimal action would therefore be the action which results in the highest value of the action.
Take the image above as an example. The agent starts at state, s0
and is able to move up, left, right, or down which are the actions. The actions that the agent can take are stochastic in nature and not deterministic but it i.e., when the agent intends to go up there is a 0.33%
chance that the agent might instead go to the left or the right. I will just assign a value of 1 to gamma here.
This is how you calculate the Q-values
for the state, s0
and action, up
with the values of going to the state being the immediate reward received by the agent, V1 = 1, V2 = 2, V3 = 3, V4 = 4
.
Q(s0,up) = 0.33 * V1 + 0.33 * V2 0.33 * V4
= 0.33 * 1 + 0.33 * 2 + 0.33 * 4
= 2.31
Next, if you calculate Q-values for all the other possible states and their actions you would get the following:
Q(s0,left) = 1.98
Q(s0,right) = 2.64
Q(s0,down) = 2.97
Therefore the final value for the state, s0
is the maximum
of those actions' values which is 2.97
. That is all you really trying to do there in the code.
As for what does currentQ[action] = newQ;
do, it is performing an update on the current Q-values
for an action from its old value to the new updated value at the end of an episode.
One thing you have to understand as to why does it do this, is that the agent updates its Q-values
after an episode, then training is done once again and the process is repeated up until the agent manages to complete its goal(for the Atari paper that this algorithm was introduced from, that goal was having a mean score of I think 19 which is equivalent to winning 19 games out of 21 games).
You can read more about the entire process from the original paper.
But I think you need more of an understanding of Bellmans equation before that as it is extremely important in understanding Reinforcement Learning. DeepMind has an excellent Youtube series about this that can be found here.
Even better there is a free book on Reinforcement Learning from the founding fathers of it, Richard Sutton and Andrew Barto. I believe they go in detail about this in Chapter 4.
Edit:
I am not too sure what you mean by how it affects training but I will outline the entire process for you to understand how training works for this: