开发者:上海品职教育科技有限公司隐私政策详情

应用版本:4.2.11(IOS)｜3.2.5(安卓)APP下载

学习体验
App下载
手机上的品职教育

随时随地学习课程，支持音视频下载！
- 扫码下载品职教育APP
进入课程
登录 | 注册

Cooljas · 2024年05月07日

为啥这个取0.9啊？

NO.PZ2023091601000107

问题如下：

suppose that there are four states and three actions, and that the current Q(S, A) values are as indicated in Table 14.2.

a. Suppose that on the next trial, Action 3 is taken in State 4 and the total subsequent reward is 1.0. If α = 0.05, what will the value of Q(4,3) be updated using Monte Carlo method ?

b. Suppose that the next decision that has to be made on the trial we are considering turns out to be when we are in State 3. Suppose further that a reward of 0.2 is earned between the two decisions. Using the temporal difference method, we would note that the value of being in State 3 is currently estimated to be 0.9. If a = 0.05, what will the value of Q(4,3) be updated using Temporal difference learning?

选项：

解释：

a.If α = 0.05, the Monte Carlo method would lead to Q(4,3) being updated from 0.8 to: 0.8 + 0.05(1.0 − 0.8) = 0.81

b. Suppose that when we take action A in state S we move to state S'. We can use the current value for V(S’) to update as follows:

Qnew(S,A) = Qold(S,A) +α[R+γV(S') - Qold(S,A)]

where R is the reward at the next step and γ is the discount factor.

Thus, in this example, the temporal difference method would lead to Q(4,3) being updated from 0.8 to: 0.8 + 0.05(0.2 + 0.9 − 0.8) = 0.815

添加评论

0
0

1 个答案

品职答疑小助手雍 · 2024年05月09日

同学你好，

考点是Temporal difference learning，在基础班讲义的“Reinforcement Learning”这部分。P422开始，后面有一个例题：

可以结合这部分视频课，1.5倍速 40:00开始：

添加评论

0
0

1
回答
0
关注
155
浏览

我要回答关注问题

相关问题

NO.PZ2023091601000107问题如下 suppose ththere are four states anthreeactions, anththe current Q(S, values are incatein Table 14.2. Suppose thon the next trial, Action 3 istaken in State 4 anthe totsubsequent rewaris 1.0. If α = 0.05, whwillthe value of Q(4,3) upteusing Monte Carlo metho?Suppose ththe next cisionthhto ma on the triwe are consiring turns out to when we arein State 3. Suppose further tha rewarof 0.2 is earnebetween the twocisions. Using the temporfferenmetho we woulnote ththe valueof being in State 3 is currently estimateto 0.9. If a = 0.05, whwillthe value of Q(4,3) upteusing Temporfferenlearning? a.If α = 0.05, the Monte Carlo methowoulleto Q(4,3) beinguptefrom 0.8 to: 0.8 + 0.05(1.0 − 0.8) = 0.81Suppose thwhen we take action A in state S we move to state S'.We cuse the current value for V(S’) to upte follows:Qnew(S,= QolS,+α[R+γV(S') -QolS,A)]where R is the rewarthe next step anγis the scount factor.Thus, in this example, the temporalfferenmethowoulleto Q(4,3) being uptefrom 0.8 to: 0.8 +0.05(0.2 + 0.9 − 0.8) = 0.815 公式是什么谢谢什么考点

2024-05-07 20:35 1 · 回答

NO.PZ2023091601000107 问题如下 suppose ththere are four states anthreeactions, anththe current Q(S, values are incatein Table 14.2. Suppose thon the next trial, Action 3 istaken in State 4 anthe totsubsequent rewaris 1.0. If α = 0.05, whwillthe value of Q(4,3) upteusing Monte Carlo metho?Suppose ththe next cisionthhto ma on the triwe are consiring turns out to when we arein State 3. Suppose further tha rewarof 0.2 is earnebetween the twocisions. Using the temporfferenmetho we woulnote ththe valueof being in State 3 is currently estimateto 0.9. If a = 0.05, whwillthe value of Q(4,3) upteusing Temporfferenlearning? a.If α = 0.05, the Monte Carlo methowoulleto Q(4,3) beinguptefrom 0.8 to: 0.8 + 0.05(1.0 − 0.8) = 0.81Suppose thwhen we take action A in state S we move to state S'.We cuse the current value for V(S’) to upte follows:Qnew(S,= QolS,+α[R+γV(S') -QolS,A)]where R is the rewarthe next step anγis the scount factor.Thus, in this example, the temporalfferenmethowoulleto Q(4,3) being uptefrom 0.8 to: 0.8 +0.05(0.2 + 0.9 − 0.8) = 0.815 这里的0.9是否需要折现？

2023-10-17 16:15 1 · 回答