Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
555 views
in Technique[技术] by (71.8m points)

terminology - What is a policy in reinforcement learning?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The definition is correct, though not instantly obvious if you see it for the first time. Let me put it this way: a policy is an agent's strategy.

For example, imagine a world where a robot moves across the room and the task is to get to the target point (x, y), where it gets a reward. Here:

  • A room is an environment
  • Robot's current position is a state
  • A policy is what an agent does to accomplish this task:

    • dumb robots just wander around randomly until they accidentally end up in the right place (policy #1)
    • others may, for some reason, learn to go along the walls most of the route (policy #2)
    • smart robots plan the route in their "head" and go straight to the goal (policy #3)

Obviously, some policies are better than others, and there are multiple ways to assess them, namely state-value function and action-value function. The goal of RL is to learn the best policy. Now the definition should make more sense (note that in the context time is better understood as a state):

A policy defines the learning agent's way of behaving at a given time.

Formally

More formally, we should first define Markov Decision Process (MDP) as a tuple (S, A, P, R, y), where:

  • S is a finite set of states
  • A is a finite set of actions
  • P is a state transition probability matrix (probability of ending up in a state for each current state and each action)
  • R is a reward function, given a state and an action
  • y is a discount factor, between 0 and 1

Then, a policy π is a probability distribution over actions given states. That is the likelihood of every action when an agent is in a particular state (of course, I'm skipping a lot of details here). This definition corresponds to the second part of your definition.

I highly recommend David Silver's RL course available on YouTube. The first two lectures focus particularly on MDPs and policies.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...