Reinforcement Learning

A remark of RL from various source (CMPUT 365, paper, blogs, textbook, etc.)

This page is in the process of editing. Some details and references are still missing. Please forgive me!!

Fundamental concept of RL

This section includes notations (different versions) as well as important formulas. Derivations and specific details will be discussed in later sections.

Bandit

Dynamic Programming

Monte Carlo

Temporal Difference

Learning/Planning Dyna-Q

Function approximation

Policy optimization

\(\begin{aligned}\nabla_\theta J(\pi_\theta) &= \nabla_\theta \mathbb E_{\tau\sim\pi_\theta}[R(\tau)] \\ &= \nabla_\theta \int_\tau P(\tau|\pi_\theta)R(\tau)\\ &= \int_\tau \nabla_\theta P(\tau|\pi_\theta)R(\tau)\\ &= \int_\tau P(\tau|\pi_\theta)\nabla_\theta \log P(\tau|\pi_\theta)R(\tau) & \text{Log derivative trick}\\ &= \int_\tau P(\tau|\pi_\theta)\sum_{t=0}^T\nabla_\theta \log \pi_\theta(a_t|s_t)R(\tau) & \text{Only }\pi\text{ depends on } \theta\\ &= \mathbb E_{\tau\sim\pi_\theta}[\sum_{t=0}^T\nabla_\theta \log \pi_\theta(a_t|s_t)R(\tau) ]\\ &= \mathbb E_{\tau\sim\pi_\theta}[\sum_{t=0}^T\nabla_\theta \log \pi_\theta(a_t|s_t)\sum_{t'=t}^T R(s_{t'}, a_{t'}, a_{t'+1}) ] & \text{Reward-to-go}\\ &= \mathbb E_{\tau\sim\pi_\theta}[\sum_{t=0}^T\nabla_\theta \log \pi_\theta(a_t|s_t)(\sum_{t'=t}^T R(s_{t'}, a_{t'}, a_{t'+1}) - b(s_t)) ] & \text{Baseline in PG}\\ &= \mathbb E_{\tau\sim\pi_\theta}[\sum_{t=0}^T\nabla_\theta \log \pi_\theta(a_t|s_t)\Phi_t ] & \text{Baseline in PG} \end{aligned}\)

Deep RL

VPG

\[\theta_{k+1} = \theta_k + \alpha_k\frac{1}{|D|}\sum_{\tau\in D}\sum_{t=0}^T\nabla_{\theta} \log \pi_\theta(a_t|s_t)\|_{\theta_k}\hat A_t\] \[\phi_{k+1} = \arg\min_\phi \frac{1}{|D|T}\sum_{\tau\in D}\sum_{t=0}^T (V_\phi(s_t) - \hat R_t)^2\]

TRPO

\[\begin{aligned} \theta_{k+1} &= \arg\max_\theta L(\theta_k, \theta) & \text{ s.t }D_{KL}(\theta||\theta_k) \le \delta\\ &= \arg\max_\theta \mathbb E_{s,a\sim\theta_{\theta_k}}[\frac{\pi_\theta(a|s)}{\pi_{\theta_k}(a|s)}A^{\pi_{\theta_k}}(s,a)] & \text{ s.t }E_{s\sim\pi_{\theta_k}}[D_{KL}(\pi_\theta(.|s)||\pi_{\theta_k}(.|s))] \le \delta\\ &= \arg\max_\theta g^T(\theta-\theta_k) & \text{ s.t } \frac{1}{2}(\theta-\theta_k)^TH(\theta-\theta_k) \le \delta \\ &= \theta_k + \sqrt{\frac{2\delta}{g^TH^{-1}g}}H^{-1}g & \text{Lagrangian duality}\\ &= \theta_k + \alpha^j\sqrt{\frac{2\delta}{g^TH^{-1}g}}H^{-1}g & \text{Backtracking}\\ &= \theta_k + \alpha^j\sqrt{\frac{2\delta}{g^T\hat{x}}}\hat{x} & \text{Expensive computation } H^{-1} \end{aligned}\]

PPO

\[\begin{aligned} L(s,a,\theta_k,\theta) & = \min (\frac{\pi_\theta(a|s)}{\pi_{\theta_k}}(a|s)A^{\pi_{\theta_k}(s,a)}, clip(\frac{\pi_\theta(a|s)}{\pi_{\theta_k}(a|s)}, 1-\epsilon, 1+\epsilon)A^{\pi_{\theta_k}}(s,a)) \\ &= \min (\frac{\pi_\theta(a|s)}{\pi_{\theta_k}}(a|s)A^{\pi_{\theta_k}(s,a)}, g(\epsilon, A^{\pi_{\theta_k}}(s,a)) \end{aligned}\]

DDPG

\[\begin{aligned}L(\phi,D) &= \mathbb E_{s,a,r,s',d\sim D}[(Q_\phi(s,a) - (r + \gamma(1-d)Q_{\phi_{targ}}(s',\mu_{\theta_{targ}}(s'))))^2] \\ &\approx \frac{1}{B}\sum_{(s,a,r,s',d) \in B} (Q_\phi(s,a) - (r + \gamma(1-d)Q_{\phi_{targ}}(s',\mu_{\theta_{targ}}(s'))))^2 \end{aligned}\] \[E_{s\in D}[Q_\phi(s,\mu_\theta(s))] \approx \frac{1}{B} \sum_{s\in B}Q_\phi(s,\mu_\theta(s))\]

Twin Delayed DDPG

\[\begin{aligned} Q_{\phi_{targ}}(s',a'(s')) &= Q_{\phi_{targ}}(s', clip(\mu_{\theta_{targ}}(s') + clip(\epsilon,-c,c), a_{Low}, a_{High}))) & \epsilon\sim N(0,\sigma) \end{aligned}\]

SAC

\[\begin{aligned} Q^\pi(s,a) &= \mathbb E_{s'\sim P, a'\sim \pi}[R(s,a,s') + \gamma(Q^\pi(s',a') + \alpha H(\pi(.|s')))] & \text{Regularized entropy}\\ &= \mathbb E_{s'\sim P, a'\sim \pi}[R(s,a,s') + \gamma(Q^\pi(s',a') - \alpha \log\pi(a'|s'))] \\ &= r + \gamma (Q^\pi(s',\tilde a') - \alpha \log\pi(\tilde a'|s')) & \text{Current policy, not from replay buffer} \end{aligned}\] \[\begin{aligned} V^\pi(s,a) &= E[Q^\pi(s,a) -\alpha \log\pi(a|s)] \\ &= E_{s\in D,\delta\sim\mathcal N}[Q^\pi(s,\tilde a_\theta(s,\delta)) -\alpha \log\tilde a_\theta(s,\delta)] & \tilde a_\theta(s,\delta) = \tanh(\mu_\theta(s) + \sigma_\theta(s) \odot \delta)\\ \end{aligned}\]

Reference