Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
331 views
in Technique[技术] by (71.8m points)

javascript - How does work this implementation of DQN algorithm on TensorFlowJs?

devs,

I found a bunch of examples of DQN implementations, but because I'm no TensorFlow expert, I'm a little bit confused.

Let's see here is one one of them.

I can understand, on the 73rd line, we slice some batch of stored data [{state, action, reward, newState, done}] exactly, then we get currentStates which is [[s1, s2, ...]], then on 75 we use the model to get currentQs which should be, how I understand, [[act1, act2, ...]], because our model is used to get action from env's state. The same happens to newCurrentStates and futureQs.

But then on 88, we see let maxFutureQ = Math.max(futureQs);. What happened here? futureQs is an array of arrays with actions probabilities for each futureState? And then maxFutureQ should be an action probability, why then we add this to reward? This part is confusing me.

Also I cannot understand why we need to do currentQ[action] = newQ; on 94.

Please, could someone help me to understand what is going on here and leave comments for lines, maybe?

Thanks in advance.

edit:

discussed code: discussed code


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The part that is confusing you is the Bellman approximation which is used to update the Q-values of a state that is defined as s given an action a is taken.

enter image description here

Q for this state, s, and action, a, equals the expected immediate reward and the discounted long-term reward of the destination state.

We take the maximum of the values of the Q-values(or the value of the action) of being at state s' which is the next state going from state s, with an action a', as the actions we can take when going from a state s to a state s' are a set of mutually exclusive discrete set (i.e., your environment allows you to move in the direction up, left, right or down) and the most optimal action would therefore be the action which results in the highest value of the action.

enter image description here

Take the image above as an example. The agent starts at state, s0 and is able to move up, left, right, or down which are the actions. The actions that the agent can take are stochastic in nature and not deterministic but it i.e., when the agent intends to go up there is a 0.33% chance that the agent might instead go to the left or the right. I will just assign a value of 1 to gamma here.

This is how you calculate the Q-values for the state, s0 and action, up with the values of going to the state being the immediate reward received by the agent, V1 = 1, V2 = 2, V3 = 3, V4 = 4.

Q(s0,up) = 0.33 * V1 + 0.33 * V2  0.33 * V4  
         = 0.33 * 1 + 0.33 * 2 + 0.33 * 4 
         = 2.31

Next, if you calculate Q-values for all the other possible states and their actions you would get the following:

Q(s0,left) = 1.98
Q(s0,right) = 2.64
Q(s0,down) = 2.97

Therefore the final value for the state, s0 is the maximum of those actions' values which is 2.97. That is all you really trying to do there in the code.

As for what does currentQ[action] = newQ; do, it is performing an update on the current Q-values for an action from its old value to the new updated value at the end of an episode.

One thing you have to understand as to why does it do this, is that the agent updates its Q-values after an episode, then training is done once again and the process is repeated up until the agent manages to complete its goal(for the Atari paper that this algorithm was introduced from, that goal was having a mean score of I think 19 which is equivalent to winning 19 games out of 21 games).

You can read more about the entire process from the original paper.

But I think you need more of an understanding of Bellmans equation before that as it is extremely important in understanding Reinforcement Learning. DeepMind has an excellent Youtube series about this that can be found here.

Even better there is a free book on Reinforcement Learning from the founding fathers of it, Richard Sutton and Andrew Barto. I believe they go in detail about this in Chapter 4.

Edit:

I am not too sure what you mean by how it affects training but I will outline the entire process for you to understand how training works for this:

enter image description here


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...