博文首页 > tianjuewudi的博客:强化学习之TRPO

    tianjuewudi的博客:强化学习之TRPO

    作者:[db:作者] 时间:2021-10-19 09:53

    视频链接:https://www.youtube.com/watch?v=fcSYiyvPjm4&list=PLp0tvPwd1T7AD822A9tJ-jfQnMtSKh_Rz&index=3&ab_channel=ShusenWang

    TRPO算法重复着两个步骤:

    1. 近似:我们构建一个 L ( θ ∣ θ o l d ) L(\theta|\theta_{old}) L(θθold?)函数,在信赖域内近似于价值函数 J ( θ ) J(\theta) J(θ)
    2. 最大化:在信赖域内,找到一组新的参数,使得 L ( θ ∣ θ o l d ) L(\theta|\theta_{old}) L(θθold?)最大化。

    近似:

    V π ( s ) = ∑ a π ( a ∣ s ; θ ) ? Q π ( s , a ) = ∑ a π ( a ∣ s ; θ o l d ) π ( a ∣ s ; θ ) π ( a ∣ s ; θ o l d ) ? Q π ( s , a ) = E A ~ π ( a ∣ s ; θ o l d ) [ π ( a ∣ s ; θ ) π ( a ∣ s ; θ o l d ) ? Q π ( s , a ) ] V_{\pi}(s) = \sum_a \pi(a|s;\theta) * Q_{\pi}(s,a) \\ = \sum_a \pi(a|s;\theta_{old})\frac{\pi(a|s;\theta)}{\pi(a|s;\theta_{old})} * Q_{\pi}(s,a) \\ = E_{A~\pi(a|s;\theta_{old})}[\frac{\pi(a|s;\theta)}{\pi(a|s;\theta_{old})} * Q_{\pi}(s,a)] Vπ?(s)=a?π(as;θ)?Qπ?(s,a)=a?π(as;θold?)π(as;θold?)π(as;θ)??Qπ?(s,a)=EAπ(as;θold?)?[π(as;θold?)π(as;θ)??Qπ?(s,a)]

    J ( θ ) = E S [ V π ( S ) ] = E S , A [ π ( A ∣ S ; θ ) π ( A ∣ S ; θ o l d ) ? Q π ( S , A ) ] J(\theta) = E_S[V_{\pi}(S)] \\ = E_{S,A}[\frac{\pi(A|S;\theta)}{\pi(A|S;\theta_{old})} * Q_{\pi}(S,A)] J(θ)=ES?[Vπ?(S)]=ES,A?[π(AS;θold?)π

    下一篇:没有了
百度