A remark of RL from various source (CMPUT 365, paper, blogs, textbook, etc.)
This page is in the process of editing. Some details and references are still missing. Please forgive me!!
Fundamental concept of RL
This section includes notations (different versions) as well as important formulas. Derivations and specific details will be discussed in later sections.
Remark: Please note that any characters with subscript t or capital letter will be random variables
Reinforcement Learning vs Optimal Control vs Operational Research vs Economics:
Optimal Control: Control theory
Operational Research: Stochastic shortest path
Economics: Sequential decision making under uncertainty
Agent: receives $o_t$ and $r_t$, emit $a_t$
Environment: Agent lives and interact with, receives $a_t$, emits $o_t$ and $r_{t+1}$
To enforce Markovian stationary property, we can expand state components until the properties are satisfied. However, this solution also increases the computational complexity
Remark: Do not care about the transition tuples, how actions are selected, because it should satisfy Bellman equation for all possible transitions
Occupancy measure: stationary distribution over $\mathcal S \times \mathcal A$ or $\mathcal A$ space, induced by running policy $\pi$ in the environment.
Lemma: 2 policies $\pi$ and $\pi'$ are equivalent iff they have the same $V^\pi(s) = V^{\pi'}(s) \forall s \in \mathcal S$ or $\rho^\pi(s,a) = \rho^{\pi'}(s,a), \forall s\in\mathcal S,a\in\mathcal A$
Lemma: For all $\pi$, there always exists a stationary policy $\pi'$ with the same occupancy measure.
Corollary: It is sufficed to consider stationary policy.
Definition: How much an action is better than others
Remark: It is easier to learn the consequence of an action is better than the other, than it is to learn the actual return from taking actions
Models
Model-based: learn $P(.|s_t. a_t)$ and $r$, sample efficiency, planning, sometimes no ground-truth model, bias leads to suboptimal performance, still challenge in model-learning
Pure planning: model predictive control to select actions, not policy, MBMF
Expert iteration: learn explicit representation of $\pi_\theta(a|s)$, planning (MCTS) and generate actions, AlphaZero
Data augmentation: use fictitious and/or real experience to learn $\pi(.|s)$ or $Q^\pi$, MBVE, World Models
Embedding planning loops: embed planning into policy as a subrountine, l2A
Model-free: Easy to implement, tune, not sample efficiency
Policy optimization: learn approximator $\pi_\theta(a|s)$ for $\pi(a|s)$, directly or indirectly optimize $J(\pi_\theta)$, on-policy (most updated version of policy), usually involves learning $V_\theta(s)$ for $V^\pi(s)$, A2C, A3C, PPO
Q-Learning: learn approximator $Q_\theta(s,a)$ for $Q^*$, off-policy (data collected at any point during training), DQN, C51
Trade-off: direct vs indirect optimization, stable vs sample efficiency.
Interpolation: under some circumstances $\rightarrow$ equivalent, DDPG, SAC
Prediction vs Control
Prediction:
Input: MDP or MRP and $\pi$
Output: $V^{\pi}$ or $Q^{\pi}$
Control:
Input: MDP or MRP
Output: $V^{*}$ or $Q^{*}$
Bandit
Dynamic Programming
Optimal substructure + Overlapping subproblems
Algorithm: Policy iteration, value iteration
Synchronous vs Asynchronous (in-place, prioritised sweeping, real-time)
Properties: full-width backups, effective for medium-sized problems (millions of states)
Consider two policy $\pi(a|s)$ and $\pi'(a|s)$, if $\foralll s \in \mathcal S$, we have that $Q^\pi(s,\pi') \ge V^\pi(s)$, then it holds $V^{\pi'}(s) \ge V^\pi(s), \forall s\in\mathcal S$. Then $\pi'$ is at least as good as $\pi$.
Theorem: Policy Iteration generates a sequence of policies with non-decreasing value function. When the MDP is finite, the convergence occurs in a finite number of iterations.
Remark: $\mathcal T^\pi$ has the same property as $\mathcal T$
Principle of Optimality: The policy $\pi$ can achieve optimal value from state s $V^*(s)$ if and only if any state $s'$ is reachable from s and $\pi$ achieve optimal value from state $s'$
Value iteration: inspired by principle of optimality
For all $U, V\in\mathbb R^{|\mathcal S|}$, there exists $\gamma \in[0,1)$ such that $\mathcal T$ is a contraction mapping, \(\|\mathcal T U-\mathcal T V \|_{\infty} \le \| U - V \|_{\infty}\)
Another version of policy gradient theorem for discrete action-state space: $\nabla_\theta J(\pi_\theta) = \sum_s \rho^\pi(s)\sum_a\nabla_\theta \pi_\theta(a|s) Q^\pi(s,a)$
Similar idea to TRPO, but relaxed objective. The distance between old policy and new policy is constrained by $1+\epsilon$ \(\theta_{k+1} = \arg\max_\theta \mathbb E_{s,a \sim\pi_{\theta_k}}[L(s,a,\theta_k,\theta)]\)
Replay Buffer: Large enough for stability, impossible to contain everything, trade-off between recent data (overfitting) and old data (slow)
Target Networks: Target is also parameterized by $\phi, \theta \rightarrow$ unstable training, parameterized as $\phi_{targ}, \theta_{targ}$, copy of main network for every fixed number of steps, updated as $\phi_{targ} = \rho\phi_{targ} + (1-\rho)\phi, \theta_{targ} = \rho\theta_{targ} + (1-\rho)\theta$
Entropy regularized RL: how random a random variable is, $H(P) = \mathbb E_{x\sim P}[-\log P(x)]$, more reward for high entropy (exploration)
Similar to TD3: Q-functions with MSBE, target Q-networks (polyak averaging), clipped double-Q
Different from TD3: entropy regularization, stochastic policy replaces target policy smoothing, next state-action comes from current policy instead of target policy (replay buffer)