开发者:上海品职教育科技有限公司 隐私政策详情

应用版本:4.2.11(IOS)|3.2.5(安卓)APP下载

🧸苏小糖yb💚 · 2024年05月07日

这里是什么考点

NO.PZ2023091601000107

问题如下:

suppose that there are four states and three actions, and that the current Q(S, A) values are as indicated in Table 14.2.


a. Suppose that on the next trial, Action 3 is taken in State 4 and the total subsequent reward is 1.0. If α = 0.05, what will the value of Q(4,3) be updated using Monte Carlo method ?

b. Suppose that the next decision that has to be made on the trial we are considering turns out to be when we are in State 3. Suppose further that a reward of 0.2 is earned between the two decisions. Using the temporal difference method, we would note that the value of being in State 3 is currently estimated to be 0.9. If a = 0.05, what will the value of Q(4,3) be updated using Temporal difference learning?

解释:

a.If α = 0.05, the Monte Carlo method would lead to Q(4,3) being updated from 0.8 to: 0.8 + 0.05(1.0 − 0.8) = 0.81

b. Suppose that when we take action A in state S we move to state S'. We can use the current value for V(S’) to update as follows:

Qnew(S,A) = Qold(S,A) +α[R+γV(S') - Qold(S,A)]

where R is the reward at the next step and γ is the discount factor.

Thus, in this example, the temporal difference method would lead to Q(4,3) being updated from 0.8 to: 0.8 + 0.05(0.2 + 0.9 − 0.8) = 0.815

公式是什么谢谢 什么考点

1 个答案

李坏_品职助教 · 2024年05月07日

嗨,从没放弃的小努力你好:


考点是Temporal difference learning,在基础班讲义的“Reinforcement Learning这部分。P422开始,后面有一个例题:

可以结合这部分视频课,1.5倍速 40:00开始:




----------------------------------------------
虽然现在很辛苦,但努力过的感觉真的很好,加油!

  • 1

    回答
  • 0

    关注
  • 159

    浏览
相关问题

NO.PZ2023091601000107 问题如下 suppose ththere are four states anthreeactions, anththe current Q(S, values are incatein Table 14.2. Suppose thon the next trial, Action 3 istaken in State 4 anthe totsubsequent rewaris 1.0. If α = 0.05, whwillthe value of Q(4,3) upteusing Monte Carlo metho?Suppose ththe next cisionthhto ma on the triwe are consiring turns out to when we arein State 3. Suppose further tha rewarof 0.2 is earnebetween the twocisions. Using the temporfferenmetho we woulnote ththe valueof being in State 3 is currently estimateto 0.9. If a = 0.05, whwillthe value of Q(4,3) upteusing Temporfferenlearning? a.If α = 0.05, the Monte Carlo methowoulleto Q(4,3) beinguptefrom 0.8 to: 0.8 + 0.05(1.0 − 0.8) = 0.81Suppose thwhen we take action A in state S we move to state S'.We cuse the current value for V(S’) to upte follows:Qnew(S,= QolS,+α[R+γV(S') -QolS,A)]where R is the rewarthe next step anγis the scount factor.Thus, in this example, the temporalfferenmethowoulleto Q(4,3) being uptefrom 0.8 to: 0.8 +0.05(0.2 + 0.9 − 0.8) = 0.815

2024-05-07 21:42 1 · 回答

NO.PZ2023091601000107 问题如下 suppose ththere are four states anthreeactions, anththe current Q(S, values are incatein Table 14.2. Suppose thon the next trial, Action 3 istaken in State 4 anthe totsubsequent rewaris 1.0. If α = 0.05, whwillthe value of Q(4,3) upteusing Monte Carlo metho?Suppose ththe next cisionthhto ma on the triwe are consiring turns out to when we arein State 3. Suppose further tha rewarof 0.2 is earnebetween the twocisions. Using the temporfferenmetho we woulnote ththe valueof being in State 3 is currently estimateto 0.9. If a = 0.05, whwillthe value of Q(4,3) upteusing Temporfferenlearning? a.If α = 0.05, the Monte Carlo methowoulleto Q(4,3) beinguptefrom 0.8 to: 0.8 + 0.05(1.0 − 0.8) = 0.81Suppose thwhen we take action A in state S we move to state S'.We cuse the current value for V(S’) to upte follows:Qnew(S,= QolS,+α[R+γV(S') -QolS,A)]where R is the rewarthe next step anγis the scount factor.Thus, in this example, the temporalfferenmethowoulleto Q(4,3) being uptefrom 0.8 to: 0.8 +0.05(0.2 + 0.9 − 0.8) = 0.815 这里的0.9是否需要折现?

2023-10-17 16:15 1 · 回答