Reinforcement Learning

A remark of RL from various source (CMPUT 365, paper, blogs, textbook, etc.)

This page is in the process of editing. Some details and references are still missing. Please forgive me!!

Fundamental concept of RL

This section includes notations (different versions) as well as important formulas. Derivations and specific details will be discussed in later sections.

Remark: Please note that any characters with subscript t or capital letter will be random variables
Reinforcement Learning vs Optimal Control vs Operational Research vs Economics:
- Optimal Control: Control theory
- Operational Research: Stochastic shortest path
- Economics: Sequential decision making under uncertainty
Agent: receives $o_t$ and $r_t$, emit $a_t$
Environment: Agent lives and interact with, receives $a_t$, emits $o_t$ and $r_{t+1}$
Markov property: $P(s_{t+1} | s_1, a_1, …, s_t, a_t) = P(s_{t+1} |s_t, a_t)$
Reward: $r_t=R(s_t, a_t, s_{t+1})$ or $R(s_t,a_t)$ or $R(s_t)$
- Definition: Scalar signal received from the environment, estimation of the current state
- Matrix format: $\textbf r\in\mathbb R^{|\mathcal S||\mathcal A| \times 1}$
- Deterministic vs stochastic (a random variable with zero mean and non-zero variance)
- Bounded reward: $[0, R_{max}]$
- Remark: agent does not control the reward it receives or randomness of the env
- Reward hypothesis: All goals can be described by the maximization of expected cumulative reward
- Reward function: $R(s_t = s, a_t = a) = \mathbb E[R_t | s_t= s, a_t = a]= \sum_{r\in \mathbb R}\sum_{s'\in S} r P(s',r | s,a)$
  - $R(s_t = s, a_t = a, s_{t+1} = s') = \mathbb E[R_t | s_t= s, a_t = a, s_{t+1} = s'] = \sum_{t=0} \sum_{r\in\mathbb R} \gamma^t r \frac{P(s', r|s, a)}{P(s'|s, a)}$
Return: $G_t$ or $R(\tau)$
- Definition: cumulative reward, goal of the agent (maximize)
- Finite-horizon undiscounted return: $R(\tau) = \sum_{t=0}^T r_t = \sum_{t=0}^T R(s_t, a_t) $
- Infinite-horizon discounted return: $R(\tau) = \sum_{t=0}^{\infty} \gamma^tr_t = \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t)$ (mathematically convenient, but not implement in practice)
- Expected Return:
  - Continuous state action space:$J(\pi) = \mathbb E_{\tau\sim\pi,P,\rho_0}[R(\tau)] = \int_{\tau} P(\tau|\pi)R(\tau) = \int_{\tau} \rho_0(s_0)\prod_{t=0}^{T-1}P(s_{t+1}|s_t,a_t)\pi(a_t|s_t)R(\tau)$
  - Discrete state action space: $J(\pi) = \mathbb E_{\tau\sim\pi,P,\rho_0}[R(\tau)] = \sum_{s} \sum_{a} P(s_t = s|\rho_0, \pi, P)\pi(a_t = a|s_t= s) r(s_t = s,a_t = a)$
State: $s_t$
- Definition: Complete description of the state, no hidden information from the world, Markov property
- State transition: deterministic ($f(s_t,a_t)$) vs stochastic ($s_{t+1} \sim P(.|s_t,a_t)$)
History: $\mathcal H_t = a_1, o_1, r_1,…, a_t, o_t, r_t$, $f(\mathcal H_t) = s_t^\alpha \approx s_t$
Observation: $o_t$, partial description of a state, generally not Markov property
Observability: $s_t \equiv o_t$
Action: $a_t$, discrete (left/right) vs continuous (speeding)
Trajectories: $\tau = (s_0, a_0, s_1, a_1, …)$, episodes or rollouts
Task: Continuing (infinite T) and discrete (finite T)
Markov processes: $\langle \mathcal S, P \rangle$
- $P: \mathcal S \rightarrow \Delta(\mathcal S)$
Markov reward processes: $\langle \mathcal S, P, R, \gamma\rangle$
- $R: \mathcal S \rightarrow \mathbb{R}$
- $P: \mathcal S \rightarrow \Delta(\mathcal S)$
Markov decision processes: $\langle \mathcal S, \mathcal A, R, P,\gamma, \rho_0\rangle$
- $R: \mathcal S\times \mathcal A\times \mathcal S \rightarrow \mathbb{R}$
- $P:\mathcal S\times\mathcal A\rightarrow \Delta(\mathcal S)$
Partially Observable Markov Decision Process: $\langle \mathcal S, \mathcal A, \mathcal O, P, R, \mathcal Z, \gamma, \rho_0\rangle$
- $\mathcal O$: Finite set of observations
- $\mathcal Z$: Observation function
- Belief state $b(h) = (P(s_t = s_1 | H_t = h), … , P(s_t = s_{|\mathcal S |}) | H_t = h)$
Probability transition:
- Four-argument dynamics: $P(s', r | s, a) = \Pr(s_{t+1} = s', R_t = r | s_t = s, a_t =a )$
- Three-argument dynamics: $P(s' | s, a) = \sum_{r} \Pr(s_{t+1} = s', R_t = r | s_t = s, a_t =a )$
- Matrix format: $\textbf P \in\mathbb [0,1]^{|\mathcal S||\mathcal A|\times | \mathcal S|}$
Policy: $\pi$
- Definition: Rules used by agents
- Matrix format: $\Pi \in [0,1]^{| \mathcal S |\times | \mathcal S| | \mathcal A |}$ = $diag(\pi(a|s_1) \pi(a|s_2) … \pi(a|s_{| \mathcal S|}))$
- Non-stationary :$\mathcal H \rightarrow \mathcal A$ (Dependent on time t)
- Stationary: $\mathcal S \rightarrow \Delta(\mathcal A)$ or $\mathcal S \rightarrow \mathcal A$ (Non-dependent on time t)
  - Deterministic: $a=\mu(s)$ or $a=\mu_\theta(s)$, MLP
  - Stochastic: $a\sim\pi(.|s)$ or $a\sim\pi_\theta(.|s)$, sample from $\pi(.|s)$ and compute $\log\pi(a|s)$
    - Categorical: discrete, MLP + softmax, argmax to choose the function
    - Diagonal Gaussian: continuous, reparameterization trick, $a = \mu_\theta(s) + \sigma_\theta(s) \odot z$ with $z\sim \mathbb{N}(0,I)$
- Optimal policy: $\pi^* = \arg\max_\pi J(\pi)$, multiple optimal policies, but same value function
- Better policy: $\pi' \ge \pi$ iff $V^\pi(s) \ge V^{\pi'}(s)\quad\forall s\in \mathcal S$
- Greedy policy: $\pi_Q(s) = \arg\max_a Q(s,a)$
- $\epsilon$-optimal policy: $V^\pi(s) \ge V^*(s) - \epsilon\textbf 1$
Values
- (On-policy) Value function: $V^{\pi}(s) = \mathbb E_{\tau \sim \rho_0, P, \pi}[R(\tau)|s_0=s] $, starts at s, then follow $\pi$
  - Definition: An estimation of how good it is for an agent to be in a state
  - Matrix format: $\textbf V^\pi \in \mathbb R^{|\mathcal S|\times 1}$
- (On-policy) Action-value function: $Q^{\pi}(s,a) = \mathbb E_{\tau\sim\rho_0, P, \pi}[R(\tau)|s_0=s, a_0=a]$, starts at s, random action a (not follow $\pi$), follow $\pi$
  - Definition: An estimation of how good it is for an agent to be in a state and taking an action
  - Matrix format: $\textbf Q^\pi \in \mathbb R^{|\mathcal S||\mathcal A|\times 1}$
- Optimal value function: $V^* (s) = \max_\pi V^{\pi}(s) = \max_a Q^*(s,a)$, starts at s, then follow $\pi^{*}$
- Optimal action-value function: $Q^*(s,a) = \max_\pi Q^{\pi}(s,a) = \mathbb E_{s' \sim P(.|s,a)} [ r(s,a,s') + \gamma V^{*}(s') ]$, starts at s, random action a (not follow $\pi^{*}$), then follow $\pi^{*}$
- Optimal action: $a^* = \pi^{*}(a| s) = \arg\max_a Q^* (s,a)$, maybe multiple optimal actions
Bellman Equations:
- (On-policy) Value function: $V^{\pi}(s) = \mathbb E_{a\sim \pi(.|s),s'\sim P(.|s, a)} [r(s,a,s') + \gamma V^{\pi}(s')] = Q(s, \pi(s)) =\mathbb E_{a\sim \pi(.|s)} [ Q^\pi(s,a) ] = \sum_{a}Q^\pi(s,a)\pi(a|s)$
  - Matrix format: $\textbf V^\pi = \Pi\textbf r + \gamma\Pi\textbf P\textbf V^\pi $
- (On-policy) Action-value function: $Q^\pi(s,a) = \mathbb E_{s'\sim P(.|s, a)}[r(s,a,s') + \gamma \mathbb E_{a'\sim \pi(.|s)}[Q^\pi(s',a')]] = \mathbb E_{s'\sim P(.|s, a)}[r(s,a,s') + \gamma V^\pi(s')]]$
  - Matrix format: $\textbf Q^\pi = \textbf r + \gamma \textbf P \Pi \textbf Q^\pi = \textbf r + \gamma \textbf P \textbf V^\pi $
    - Corollary: The matrix $\textbf I - \gamma \textbf P \Pi$ is invertible
  - Remark: Bootstrapping update
- Optimal value function: $V^* (s) = \max_a \mathbb E_{s' \sim P(.|s, a)}[ r(s,a,s') + \gamma V^* (s')]$
- Optimal action-value function: $Q^* (s,a) = \mathbb E_{s' \sim P(.|s, a)}[r(s,a,s') + \gamma \max_{a'} Q^* (s', a')]$
  - Remark: Do not care about the transition tuples, how actions are selected, because it should satisfy Bellman equation for all possible transitions
Occupancy measure: stationary distribution over $\mathcal S \times \mathcal A$ or $\mathcal A$ space, induced by running policy $\pi$ in the environment.
- State-action: $\rho(s,a)$
  - Unnormalized version: $\hat \rho^\pi(s,a) = \sum_{t=0}^\infty \gamma^t P(s_t = s, a_t = a| \rho_0, \pi, P)$
    - Matrix format: $(\textbf I - \gamma \textbf P \Pi)^{-1} \in [0,1]^{|\mathcal S | |\mathcal A | \times |\mathcal S | |\mathcal A |}$
  - Normalized version: $\rho^\pi(s,a) = \frac{\hat \rho^\pi(s,a)}{\sum_{s, a}\hat \rho^\pi(s,a)} = (1-\gamma) \hat \rho^\pi (s,a)$
- State: $\rho(s)$
  - Unnormalized version: $\hat \rho^\pi(s) = \sum_{t=0}^\infty \gamma^t P(s_t = s| \rho_0, \pi, P)$
    - Matrix format: $(\textbf I - \gamma \Pi \textbf P)^{-1} \in [0,1]^{|\mathcal S | \times |\mathcal S |}$
  - Normalized version: $\rho^\pi(s) = \frac{\hat \rho^\pi(s)}{\sum_{s}\hat \rho^\pi(s)} = (1-\gamma) \hat \rho^\pi (s)$
- Lemma: 2 policies $\pi$ and $\pi'$ are equivalent iff they have the same $V^\pi(s) = V^{\pi'}(s) \forall s \in \mathcal S$ or $\rho^\pi(s,a) = \rho^{\pi'}(s,a), \forall s\in\mathcal S,a\in\mathcal A$
- Lemma: For all $\pi$, there always exists a stationary policy $\pi'$ with the same occupancy measure.
  - Corollary: It is sufficed to consider stationary policy.
- Expected return:
- Value function:
- Action-value function:
Advantage functions: $A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$
- Definition: How much an action is better than others
- Remark: It is easier to learn the consequence of an action is better than the other, than it is to learn the actual return from taking actions
Models
- Model-based: learn $P(.|s_t. a_t)$ and $r$, sample efficiency, planning, sometimes no ground-truth model, bias leads to suboptimal performance, still challenge in model-learning
  - Pure planning: model predictive control to select actions, not policy, MBMF
  - Expert iteration: learn explicit representation of $\pi_\theta(a|s)$, planning (MCTS) and generate actions, AlphaZero
  - Data augmentation: use fictitious and/or real experience to learn $\pi(.|s)$ or $Q^\pi$, MBVE, World Models
  - Embedding planning loops: embed planning into policy as a subrountine, l2A
- Model-free: Easy to implement, tune, not sample efficiency
  - Policy optimization: learn approximator $\pi_\theta(a|s)$ for $\pi(a|s)$, directly or indirectly optimize $J(\pi_\theta)$, on-policy (most updated version of policy), usually involves learning $V_\theta(s)$ for $V^\pi(s)$, A2C, A3C, PPO
  - Q-Learning: learn approximator $Q_\theta(s,a)$ for $Q^*$, off-policy (data collected at any point during training), DQN, C51
  - Trade-off: direct vs indirect optimization, stable vs sample efficiency.
  - Interpolation: under some circumstances $\rightarrow$ equivalent, DDPG, SAC
Prediction vs Control
- Prediction:
  - Input: MDP or MRP and $\pi$
  - Output: $V^{\pi}$ or $Q^{\pi}$
- Control:
  - Input: MDP or MRP
  - Output: $V^{*}$ or $Q^{*}$

Bandit

Dynamic Programming

Optimal substructure + Overlapping subproblems
Algorithm: Policy iteration, value iteration
Synchronous vs Asynchronous (in-place, prioritised sweeping, real-time)
Properties: full-width backups, effective for medium-sized problems (millions of states)
Drawback: Computationally expensive and known MDP
Policy iteration: Policy evaluation + Policy improvement
- Policy evaluation: $V^{k+1}(s) = \mathbb E_{a\sim \pi(.|s),s'\sim P(.|s, a)} [r(s,a,s') + \gamma V^{k}(s')]$ (iterative)
  - Matrix format: $\textbf V^{k+1} = \Pi\textbf r + \gamma\Pi\textbf P\textbf V^{k} $
- Policy improvement: $\pi’= $greedy($V^{\pi}$) = $\arg\max_a Q^\pi(s|a)$ (greedy)
  - Consider two policy $\pi(a|s)$ and $\pi'(a|s)$, if $\foralll s \in \mathcal S$, we have that $Q^\pi(s,\pi') \ge V^\pi(s)$, then it holds $V^{\pi'}(s) \ge V^\pi(s), \forall s\in\mathcal S$. Then $\pi'$ is at least as good as $\pi$.
- - Theorem: Policy Iteration generates a sequence of policies with non-decreasing value function. When the MDP is finite, the convergence occurs in a finite number of iterations.
- Bellman expectation operator for policy $\pi$: $\mathcal T^\pi: \mathbb R^{|S|} \rightarrow \mathbb R^{|S|}$ that satisfies $\forall V \in\mathbb R^{|\mathcal S|}$, $(\mathcal T^\pi V)(s) = \mathbb E_{a\sim \pi(.|s),s'\sim P(.|s, a)} [r(s,a,s') + \gamma V^{\pi}(s')]]$
  - Remark: $\mathcal T^\pi$ has the same property as $\mathcal T$
Principle of Optimality: The policy $\pi$ can achieve optimal value from state s $V^*(s)$ if and only if any state $s'$ is reachable from s and $\pi$ achieve optimal value from state $s'$
Value iteration: inspired by principle of optimality
- $V^* (s) = \max_a \mathbb E_{s' \sim P(.|s, a)}[ r(s,a,s') + \gamma V^* (s')]$
- Bellman optimality operator: $\mathcal T: \mathbb R^{|\mathcal S|} \rightarrow \mathbb R^{|\mathcal S|}$ that satisfies $\forall V \in\mathbb R^{|\mathcal S|}$, $(\mathcal T V)(s) = \max_a \mathbb E_{s' \sim P(.| s,a)}[r(s,a', s') + \gamma V(s') ]$
- Properties:
  - $V_{k+1} = \mathcal T V_k$
  - $V^* = \mathcal T V^*$
  - For all $U, V\in\mathbb R^{|\mathcal S|}$, there exists $\gamma \in[0,1)$ such that $\mathcal T$ is a contraction mapping, $\|\mathcal T U-\mathcal T V \|_{\infty} \le \| U - V \|_{\infty}$
  - $\lim_{k\rightarrow \infty V_k} = V^*$ (unique fixed point)
- Remark: No explicit policy like policy iteration

Monte Carlo

Temporal Difference

Learning/Planning Dyna-Q

Function approximation

Policy optimization

Assumption: Finite-horizon undiscounted return
Policy gradient algorithms: Derive analytical gradient $\rightarrow$ form a sample estimate of expected return
Remark: No value function
Advantage:
- Stochastic policies
- Better convergence properties
- High-dimensional and continuous action space
Disadvantage:
- Local minimum
- Inefficient and high variance update.
Example: This example shows why policy-based method can be better than value-based method. ¹
Formula: $\theta_{k+1} = \theta_{k} + \alpha \nabla_\theta J(\pi_\theta)|_{\theta_k}$

$\begin{aligned}\nabla_\theta J(\pi_\theta) &= \nabla_\theta \mathbb E_{\tau\sim\pi_\theta}[R(\tau)] \\ &= \nabla_\theta \int_\tau P(\tau|\pi_\theta)R(\tau)\\ &= \int_\tau \nabla_\theta P(\tau|\pi_\theta)R(\tau)\\ &= \int_\tau P(\tau|\pi_\theta)\nabla_\theta \log P(\tau|\pi_\theta)R(\tau) & \text{Log derivative trick}\\ &= \int_\tau P(\tau|\pi_\theta)\sum_{t=0}^T\nabla_\theta \log \pi_\theta(a_t|s_t)R(\tau) & \text{Only }\pi\text{ depends on } \theta\\ &= \mathbb E_{\tau\sim\pi_\theta}[\sum_{t=0}^T\nabla_\theta \log \pi_\theta(a_t|s_t)R(\tau) ]\\ &= \mathbb E_{\tau\sim\pi_\theta}[\sum_{t=0}^T\nabla_\theta \log \pi_\theta(a_t|s_t)\sum_{t'=t}^T R(s_{t'}, a_{t'}, a_{t'+1}) ] & \text{Reward-to-go}\\ &= \mathbb E_{\tau\sim\pi_\theta}[\sum_{t=0}^T\nabla_\theta \log \pi_\theta(a_t|s_t)(\sum_{t'=t}^T R(s_{t'}, a_{t'}, a_{t'+1}) - b(s_t)) ] & \text{Baseline in PG}\\ &= \mathbb E_{\tau\sim\pi_\theta}[\sum_{t=0}^T\nabla_\theta \log \pi_\theta(a_t|s_t)\Phi_t ] & \text{Baseline in PG} \end{aligned}$

Another version of policy gradient theorem for discrete action-state space: $\nabla_\theta J(\pi_\theta) = \sum_s \rho^\pi(s)\sum_a\nabla_\theta \pi_\theta(a|s) Q^\pi(s,a)$
Score function: $\nabla_\theta \log \pi_\theta(a|s)$
Reward-to-go: Previous rewards does not influence the current decision, Reduces variance, improves sample efficiency.
Expected Grad-Log-Prob Lemma: $\mathbb E_{\tau\sim\pi_\theta}[\nabla_\theta \log \pi_\theta(a_t|s_t)] = 0$
Baseline in Policy Gradient: Inspired by Expected Grad-Log-Prob Lemma, $b(s_t)$ does not depend on $\theta$ ($V^{\pi}(s_t)$), reduces variance.
Remark: $\Phi_t$ can be full reward, reward-to-go, reward-to-go minus the baseline, $Q^{\pi_\theta}(s_t,a_t)$, $A^{\pi_\theta}(s_t,a_t)$
Remark: Data distribution depends on the generated policy, which depends on $\theta$
Remark: Loss function performing well does not mean improving expected return. Only care about average return, not loss.
Remark: $V^{\pi}(s_t)$ is approximated by NN $V_\phi(s_t)$, and updated concurrently with the policy
- Objective: $\phi_k = argmin_{\phi} \mathbb E_{s_t, \hat R_t \sim\pi_k}[(V_\phi(s_t) - \hat R_t)^2 ]$ (epoch k)

Deep RL

Key factors: reliability, sample efficiency
On-policy:
- VPG $\rightarrow$ TRPO $\rightarrow$ PPO
- Trade-off sample efficiency (not use old data) and reliability (optimize directly policy performance)
Off-policy:
- DDPG $\rightarrow$ TD3 $\rightarrow$ SAC
- Exploit Bellman’s equations $\rightarrow$ satisfy any environment interaction data
- No guarantees to have better policy performance.

VPG

on-policy algorithm, not much exploration, use in either discrete and continuous action space
Objective (To update $\theta$, then $\phi$):

\[\theta_{k+1} = \theta_k + \alpha_k\frac{1}{|D|}\sum_{\tau\in D}\sum_{t=0}^T\nabla_{\theta} \log \pi_\theta(a_t|s_t)\|_{\theta_k}\hat A_t\] \[\phi_{k+1} = \arg\min_\phi \frac{1}{|D|T}\sum_{\tau\in D}\sum_{t=0}^T (V_\phi(s_t) - \hat R_t)^2\]

TRPO

On-policy algorithm, the largest possible step that satisfy the constraint on how close new and old policies using KL divergence.
Objective:

\[\begin{aligned} \theta_{k+1} &= \arg\max_\theta L(\theta_k, \theta) & \text{ s.t }D_{KL}(\theta||\theta_k) \le \delta\\ &= \arg\max_\theta \mathbb E_{s,a\sim\theta_{\theta_k}}[\frac{\pi_\theta(a|s)}{\pi_{\theta_k}(a|s)}A^{\pi_{\theta_k}}(s,a)] & \text{ s.t }E_{s\sim\pi_{\theta_k}}[D_{KL}(\pi_\theta(.|s)||\pi_{\theta_k}(.|s))] \le \delta\\ &= \arg\max_\theta g^T(\theta-\theta_k) & \text{ s.t } \frac{1}{2}(\theta-\theta_k)^TH(\theta-\theta_k) \le \delta \\ &= \theta_k + \sqrt{\frac{2\delta}{g^TH^{-1}g}}H^{-1}g & \text{Lagrangian duality}\\ &= \theta_k + \alpha^j\sqrt{\frac{2\delta}{g^TH^{-1}g}}H^{-1}g & \text{Backtracking}\\ &= \theta_k + \alpha^j\sqrt{\frac{2\delta}{g^T\hat{x}}}\hat{x} & \text{Expensive computation } H^{-1} \end{aligned}\]

PPO

Similar idea to TRPO, but relaxed objective. The distance between old policy and new policy is constrained by $1+\epsilon$ $\theta_{k+1} = \arg\max_\theta \mathbb E_{s,a \sim\pi_{\theta_k}}[L(s,a,\theta_k,\theta)]$

\[\begin{aligned} L(s,a,\theta_k,\theta) & = \min (\frac{\pi_\theta(a|s)}{\pi_{\theta_k}}(a|s)A^{\pi_{\theta_k}(s,a)}, clip(\frac{\pi_\theta(a|s)}{\pi_{\theta_k}(a|s)}, 1-\epsilon, 1+\epsilon)A^{\pi_{\theta_k}}(s,a)) \\ &= \min (\frac{\pi_\theta(a|s)}{\pi_{\theta_k}}(a|s)A^{\pi_{\theta_k}(s,a)}, g(\epsilon, A^{\pi_{\theta_k}}(s,a)) \end{aligned}\]

DDPG

Off-policy (replay buffer contains data generated by outdated policy), continuous action spaces, concurrently learns Q-function (Bellman equation and off-policy data) and $\pi$ (Q-function)
Assumption: $Q^*(s,a)$ is differentiable with respect to a
Goal: Learns $Q_\phi(s,a)$ to approximate $Q^*(s,a) = \mathbb E_{s'\sim P} [ r(s,a) + \gamma max_{a'} Q^{*} (s',a')]$
Objective:

\[\begin{aligned}L(\phi,D) &= \mathbb E_{s,a,r,s',d\sim D}[(Q_\phi(s,a) - (r + \gamma(1-d)Q_{\phi_{targ}}(s',\mu_{\theta_{targ}}(s'))))^2] \\ &\approx \frac{1}{B}\sum_{(s,a,r,s',d) \in B} (Q_\phi(s,a) - (r + \gamma(1-d)Q_{\phi_{targ}}(s',\mu_{\theta_{targ}}(s'))))^2 \end{aligned}\] \[E_{s\in D}[Q_\phi(s,\mu_\theta(s))] \approx \frac{1}{B} \sum_{s\in B}Q_\phi(s,\mu_\theta(s))\]

Replay Buffer: Large enough for stability, impossible to contain everything, trade-off between recent data (overfitting) and old data (slow)
Target Networks: Target is also parameterized by $\phi, \theta \rightarrow$ unstable training, parameterized as $\phi_{targ}, \theta_{targ}$, copy of main network for every fixed number of steps, updated as $\phi_{targ} = \rho\phi_{targ} + (1-\rho)\phi, \theta_{targ} = \rho\theta_{targ} + (1-\rho)\theta$
Issues: Hyperparameter tuning, drastically overestimate Q-value (exploit Q-function errors)

Twin Delayed DDPG

Off-policy, continuous action spaces
Clipped Double-Q Learning: Learns 2 Q-functions, Uses the smaller ones as the target, reduce further the overestimation
Delayed Policy Updates: Update policy less frequently than Q-function, reduce the volatility when policy update changes the target
Target Policy Smoothing: Added noise to action (form of regularization)

\[\begin{aligned} Q_{\phi_{targ}}(s',a'(s')) &= Q_{\phi_{targ}}(s', clip(\mu_{\theta_{targ}}(s') + clip(\epsilon,-c,c), a_{Low}, a_{High}))) & \epsilon\sim N(0,\sigma) \end{aligned}\]

SAC

Both continuous and discrete action spaces
Entropy regularized RL: how random a random variable is, $H(P) = \mathbb E_{x\sim P}[-\log P(x)]$, more reward for high entropy (exploration)
Similar to TD3: Q-functions with MSBE, target Q-networks (polyak averaging), clipped double-Q
Different from TD3: entropy regularization, stochastic policy replaces target policy smoothing, next state-action comes from current policy instead of target policy (replay buffer)
Remark: The objective is no longer the $Q^*$

\[\begin{aligned} Q^\pi(s,a) &= \mathbb E_{s'\sim P, a'\sim \pi}[R(s,a,s') + \gamma(Q^\pi(s',a') + \alpha H(\pi(.|s')))] & \text{Regularized entropy}\\ &= \mathbb E_{s'\sim P, a'\sim \pi}[R(s,a,s') + \gamma(Q^\pi(s',a') - \alpha \log\pi(a'|s'))] \\ &= r + \gamma (Q^\pi(s',\tilde a') - \alpha \log\pi(\tilde a'|s')) & \text{Current policy, not from replay buffer} \end{aligned}\] \[\begin{aligned} V^\pi(s,a) &= E[Q^\pi(s,a) -\alpha \log\pi(a|s)] \\ &= E_{s\in D,\delta\sim\mathcal N}[Q^\pi(s,\tilde a_\theta(s,\delta)) -\alpha \log\tilde a_\theta(s,\delta)] & \tilde a_\theta(s,\delta) = \tanh(\mu_\theta(s) + \sigma_\theta(s) \odot \delta)\\ \end{aligned}\]

Reference

https://www.cs.cmu.edu/~rsalakhu/10703/Lectures/Lecture_PG.pdf ↩