data sciencereinforcement learning

Policy Gradient Learning (on going)

Policy Gradient Algorithms, REINFORCE, Actor-Critic, A2C, A3C, PPO.
2/4/2026
5 minutes

Policy Gradient Learning is a family of reinforcement learning algorithms that directly learns a policy that maps states to action probabilities. This policy acts as the optimal policy, outputting how probable it is that the optimal policy would choose each of the actions. We then just choose greedily or by sampling.

Another family of reinforcement learning algorithms is Value Based methods such as Q-Learning. These algorithms try to learn how good it is to be in a given state, or how good it is to be in a given state and take a certain action. Given this, they choose the action that has the most return.

REINFORCE

It is the simplest Policy Gradient based algorithm.

We have an objective function, the Expected Cumulative Discounted Return:

J(θ)=Eτπθ[t=0Tγtrt]J(\theta) = E_{\tau \sim \pi_{\theta}} [\sum_{t = 0}^{T} \gamma^t r_t]

with τ\tau being a trajectory τ=(s0,a0,r0,s1,a1,r1,)\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots) sampled from the policy πθ\pi_{\theta}.

We want to maximize our objective function J(θ)=Eτ πθ[R(τ)]J(\theta) = E_{\tau \ \sim \pi_{\theta}} [R(\tau)] with R(τ)=t=0TγtrtR(\tau) = \sum_{t=0}^{T} \gamma^t r_t being the total discounted return of trajectory τ\tau.

We can write this as:

J(θ)=p(τθ)R(τ)dτJ(\theta) = \int p(\tau | \theta) R(\tau) d_{\tau}

Now we need to find each individual term. The probability of the trajectory τ\tau under policy πθ\pi_{\theta} is:

p(τθ)=p(s0)t=0Tπθ(atst)p(st+1st,at)=Product of initial state s0 times product of the dynamics (taking correct action each time and going to the expected state after taking that action).\begin{aligned} p(\tau | \theta) &= p(s_0) \prod_{t=0}^{T} \pi_{\theta}(a_t | s_t) p(s_{t+1 | s_t, a_t})\\ &= \text{Product of initial state } s_0 \text{ times product of the dynamics (taking correct action each time and going to the expected state after taking that action).} \end{aligned}

We now try to take the gradient of J(θ)J(\theta):

θJ(θ)=θp(τθ)R(τ)dτDefinition of J(θ) as an expectation=θp(τθ)R(τ)dτMove gradient inside integral (Leibniz rule)=p(τθ)θlogp(τθ)R(τ)dτLog derivative trick: θp=pθlogp=p(τθ)[t=0Tθlogπθ(atst)]R(τ)dτExpand θlogp(τθ), drop terms independent of θ\begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \int p(\tau | \theta) R(\tau) \, d\tau && \text{Definition of } J(\theta) \text{ as an expectation}\\[6pt] &= \int \nabla_\theta p(\tau | \theta) \cdot R(\tau) \, d\tau && \text{Move gradient inside integral (Leibniz rule)}\\[6pt] &= \int p(\tau | \theta) \cdot \nabla_\theta \log p(\tau | \theta) \cdot R(\tau) \, d\tau && \text{Log derivative trick: } \nabla_\theta p = p \cdot \nabla_\theta \log p\\[6pt] &= \int p(\tau | \theta) \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \right] R(\tau) \, d\tau && \text{Expand } \nabla_\theta \log p(\tau|\theta) \text{, drop terms independent of } \theta \end{aligned}

J(θ)J(\theta) will be used as our loss function (we negate it in order to minimize it) and we thus need to be able to take its derivative. But the above form is intractable because it requires us to enumerate all possible trajectories which is impossible. On top of that, it requires us to know p(τθ)p(\tau | \theta) which is impossible as it requires us to know the environment dynamics p(st+1st,at)p(s_{t+1}|s_t,a_t) which we usually don't. Even if we did know all of that, the integral would be too expensive to compute and training would be too slow.

The solution to this is using the log derivative trick.

θJ(θ)=θp(τθ)R(τ)dτ(*)\nabla_\theta J(\theta) = \int \nabla_\theta p(\tau | \theta) \cdot R(\tau) \, d\tau \tag{*}

And we have that:

θp(τθ)=p(τθ)θlogp(τθ)\nabla_\theta p(\tau | \theta) = p(\tau | \theta) \cdot \nabla_\theta \log p(\tau | \theta)

Substituting back into (*) gives us:

θJ(θ)=p(τθ)θlogp(τθ)R(τ)dτ=Eτπθ[θlogp(τθ)R(τ)](*)\begin{aligned} \nabla_\theta J(\theta) &= \int p(\tau | \theta) \cdot \nabla_\theta \log p(\tau | \theta) \cdot R(\tau) \, d\tau\\ &= \mathbb{E}_{\tau \sim \pi_\theta}\left[ \nabla_\theta \log p(\tau | \theta) \cdot R(\tau) \right] \tag{*} \end{aligned}

Back to p(τθ)p(\tau | \theta), using the definition we gave earlier and introducing the log we get:

logp(τθ)=logp(s0)+t=0T[logπθ(atst)+logp(st+1st,at)]\log p(\tau | \theta) = \log p(s_0) + \sum_{t=0}^{T} \left[ \log \pi_\theta(a_t | s_t) + \log p(s_{t+1} | s_t, a_t) \right]

Now taking the gradient gives:

θlogp(τθ)=t=0Tθlogπθ(atst)\nabla_\theta \log p(\tau | \theta) = \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t)

The initial state and environment terms both vanish as they don't depend on θ\theta:

  • θlogp(s0)=0\nabla_\theta \log p(s_0) = 0
  • θlogp(st+1st,at)=0\nabla_\theta \log p(s_{t+1} | s_t, a_t) = 0

Now we substitute back into (*) again to get:

θJ(θ)=Eτπθ[(t=0Tθlogπθ(atst))R(τ)]=Eτπθ[t=0Tθlogπθ(atst)R(τ)](*)\begin{aligned} \nabla_\theta J(\theta) &= \mathbb{E}_{\tau \sim \pi_\theta}\left[ \left( \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \right) \cdot R(\tau) \right] \tag{*}\\ &= \mathbb{E}_{\tau \sim \pi_\theta}\left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot R(\tau) \right] \end{aligned}

This new form of (*) is now tractable because we don't need the environment dynamics anymore and the integral became tractable.

Now we consider L(θ)=J(θ)L(\theta) = -J(\theta) as a loss function and we are going to minimize it using gradient descent. In practice the expectation is ignored and we approximate the expectation with a single sample (or a small batch):

L(θ)=t=0Tlogπθ(atst)×GtL(\theta) = -\sum_{t=0}^T \log \pi_\theta (a_t|s_t) \times G_t

The default REINFORCE algorithm uses N=1N = 1 trajectory. This results in unbiased estimates of the gradient but with high variance. More sophisticated variants result in lower variance by using N>1N > 1 trajectories to estimate the gradient (to replace the expectation).

High Variance

The problem with REINFORCE is that it has very high variance. This is because the returns GtG_t can vary wildly. At some step we might get lucky and get a +100 reward and the next one we'll be getting -50, even if we were in the same state and took the same action. We got different outcomes because of environment randomness.

This variance makes learning slow and unstable.

To mitigate this we are going to consider an action as good, relative to the average rewards we got when we were at that same state and took the same action.

  • Gt=+10G_t = +10: we can't tell anything about it because we don't know how we usually get there
  • Gt=+10G_t = +10 when the average is +100: this is bad
  • Gt=+10G_t = +10 when average is -20: this is very good

So we want to know whether this action is better or worse than the typical action in this state.

To achieve this we define a new policy gradient / loss function:

θJ(θ)=E[t=0Tθlogπθ(atst)(GtV(st))At]\nabla_\theta J(\theta) = \mathbb{E}\left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \underbrace{(G_t - V(s_t))}_{A_t} \right]

With V(st)V(s_t) being an estimate of the expected return from state sts_t, the value function. This value will be learned by another neural network.

This doesn't affect our gradient and still keeps it unbiased because:

E[θlogπθ(atst)V(st)]=0\mathbb{E}\left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot V(s_t) \right] = 0

The new term we added will just vanish.

The goal of the value network is to predict the total expected cumulative discounted reward starting from the current state. We have access to the ground truth value of that at the end of each episode and for each state, since it is just GtG_t which is the sum.

In order to learn V(s)V(s) we can either do that in a separate phase: we collect episodes with a random policy and fit V(s)V(s) to the returns in a supervised learning way. But this isn't a good way to do it because the learned V(s)V(s) will eventually become outdated as the policy improves.

The better option is to jointly train V(s)V(s). We already know the loss for the policy network. The loss for the Value network is Lv=(GtVϕ(st))2L_v = (G_t - V_\phi(s_t))^2.

So at the end of each episode, we compute (predict) the values V(st)V(s_t) using the current value network. We then update it using the loss (mse_loss(values, returns)) and then using the predicted values and the returns (GtG_t) we compute At=GtV(st)A_t = G_t - V(s_t) and get the advantages which we'll be using to compute the loss for the policy network.

This reduces the variance because without this, the returns range will be Gt[100,+100]G_t \in [-100, +100] and thus the variance of GtG_t will be large.

While when using the advantages "trick", At=GtV(st)[10,+10]A_t = G_t - V(s_t) \in [-10, +10], this will result in a smaller variance.

The value function centers the signal around zero: good actions get positive advantage, bad actions get negative advantage, regardless of whether the state itself is good or bad.


Episodic Case

An episodic case is when a given task has an end.

The objective function, total expected discounted reward, can also be written as:

J(θ)=vπθ(s0)=Eτπθ[GtSt=s0]\begin{aligned} \mathrm{J}(\theta) &= \mathrm{v}_{\pi_\theta} (s_0)\\ &= \mathrm{E}_{\tau \sim \pi_\theta} [G_t | S_t = s_0] \end{aligned}

As said earlier, in practice we aren't using θJ(θ)\nabla_\theta \mathrm{J}(\theta) directly. We are instead using an unbiased approximate (the approximate's expectation is equal to the actual gradient) θJ(θ)^\hat{\nabla_\theta \mathrm{J}(\theta)}.

We use an approximate because that's all we have access to. We can't get the true gradient because this would require gathering all possible trajectories / episodes ever. We only use batches just like in supervised learning.


Actor Critic

An important distinction to make is between GtG_t which is the return, the actual ground truth, the sum of discounted rewards you collected from timestep tt until the end of the episode in one specific trajectory. Gt=rt+γrt+1+γ2rt+2++γT1rTG_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots + \gamma^{T-1} r_T.

GtG_t is a random variable. It changes every time you run an episode, even if you start from the exact same sts_t and take the exact same first action because the environment might be stochastic (same action from state doesn't necessarily lead to the same outcome).

And V(st)V(s_t) which is the value function, the prediction, Vπ(st)=E[Gtst]V_\pi(s_t) = \mathbf{E} [G_t | s_t]. It is the expected (average) return if you start from state sts_t and follow policy π\pi forever. It is a deterministic number once the policy is fixed. It is the theoretical average over infinite possible futures. It doesn't change between episodes because it's an expectation, not a specific outcome.

We already defined it above. Now to extend, there are other ways to update the critic, rather than waiting for the entire episode to end in order to update the critic. We can update the critic after seeing just one step.

To do that we can use bootstrapping (TD learning) just like we do with DQN:

V(st)=E[rt+γV(st+1)]V(s_t) = \mathbf{E}[r_t + \gamma V(s_t+1)]

So in plain english this gives "The value of being here at state sts_t equals the immediate reward rtr_t I get right now, plus the discounted value of wherever I end up next".

So instead of waiting for the entire episode / future (which gives us GtG_t), we can estimate the future using our current guess of V(st+1)V(s_{t+1}).

Bootstrapping means using your own current prediction to update yourself.

Without Bootstrapping, to update V(st)V(s_t) we need to unroll an entire episode and we would do as follows:

  1. Start at sts_t
  2. Play out an entire episode to the end
  3. Calculate Gt=rt+γrt+1+γ2rt+2+G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots
  4. Update: V(st)V(st)+α(GtV(st))V(s_t) \leftarrow V(s_t) + \alpha (G_t - V(s_t))

With Bootstrapping (TD Learning):

  1. Start at sts_t
  2. Take one step (action) ata_t, get a reward rtr_t and land in st+1s_{t+1} after observing the environment once applying ata_t
  3. Look up your current estimate V(st+1)V(s_{t+1}) (even though it's probably wrong)
  4. Calculate the target: T=rt+γV(st+1)T = r_t + \gamma V(s_{t+1})
  5. Update immediately: V(st)V(st)+α(TargetV(st))V(s_t) \leftarrow V(s_t) + \alpha (Target - V(s_t))

The target is an approximation of GtG_t that we are trying to make our critic network learn. We are computing it by basically taking the current reward and adding up the average GtG_t from the next state.

Gt=rt+γrt+1+γ2rt+2++γTtrT=rt+(γrt+1+γ2rt+2++γTtrT)=rt+γ(rt+1+γ1rt+2++γTt1rT)=rt+γGt+1\begin{aligned} G_t &= r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{T-t}r_T\\ &= r_t + (\gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{T-t}r_T)\\ &= r_t + \gamma(r_{t+1} + \gamma^1 r_{t+2} + \cdots + \gamma^{T-t-1}r_T)\\ &= r_t + \gamma G_{t+1} \end{aligned}
Error=TargetV(st)Error=(rt+γV(st+1))V(st)Error=(rt+γV(st+1))Ground Truth (Estimation)V(st)Prediction\begin{aligned} \text{Error} &= \text{Target} - V(s_t)\\ \text{Error} &= (r_t + \gamma V(s_{t+1})) - V(s_t)\\ \text{Error} &= \underbrace{(r_t + \gamma V(s_{t+1}))}_{\text{Ground Truth (Estimation)}} - \underbrace{V(s_t)}_{\text{Prediction}} \end{aligned}

This represents the error / surprise: how wrong was our prediction. Even though we are mainly using the model's output, there is still some signal in the target / ground truth which comes from the environment and which is rtr_t. With time it'll converge.

The two ways of updating the critic network are called:

  • Monte Carlo: when we only update at the end of the entire episode (at each step we use the critic to predict the returns and then we just compare the predicted returns with the actual ground truth returns we got at the end of the episode)
  • TD (Temporal Difference): when we update at each step

A2C (Advantage Actor-Critic)

A2C is what you end up with when you implement Actor-Critic following the math. The setup is as follows:

  • One agent exploring the environment
  • One actor network, one critic network
  • Collect a batch of experience (either one step or nn steps or a full episode)
  • Compute gradients, update the network and repeat

The Advantage part is the component added to stabilize learning and reduce the GtG_t variance:

A(st,at)=rt+γV(st+1)TD TargetV(st)Current PredictionA(s_t, a_t) = \underbrace{r_t + \gamma V(s_{t+1})}_{\text{TD Target}} - \underbrace{V(s_t)}_{\text{Current Prediction}}

This is called Advantage because:

  • If At>0A_t > 0, the action was better than average (advantageous). Then we increase its probability.
  • If At<0A_t < 0, the action was worse than average (disadvantageous). Decrease its probability.

A3C (Asynchronous Advantage Actor-Critic)

It was introduced by DeepMind in 2016. It solves the same problem as A2C but uses parallelism and asynchrony. The issue with A2C which is solved by A3C is that in A2C there is a single process: only one agent explores the environment. If the environment is slow (such as in Atari games), the GPU will sit idle while the CPU runs the game. Learning will be bottlenecked by the environment speed.

A3C solves this by having multiple agents exploring in parallel on different CPU threads, each with their own copy of the environment, and having them asynchronously push updates to shared global networks.

The setup is as follows:

  • One shared Actor and one shared Critic
  • N workers, parallel agents, each with:
    • Their own copy of the environment
    • Local copies of the Actor and Critic
    • Their own exploration trajectory

There is no synchronization. Multiple workers might read the global weights, compute gradients and write back at the same time. This works because the noise generated by this is helpful: it prevents getting stuck in local minimas.


PPO (Proximal Policy Optimization)

PPO is just Actor-Critic that doesn't break things. It is the same core idea but with a safety guardrail that prevents the policy from changing too dramatically in one update.

In vanilla Actor Critic we minimize:

Loss=logπθ(atst)×At\text{Loss} = - \log \pi_\theta(a_t|s_t) \times A_t

The danger is that if the gradient says to increase the probability of a specific action (e.g. going left) a lot, we might go from 50% chance left, 50% chance right to 99% chance left.

This is problematic because:

  • The action might actually be bad and we just got lucky once. Now we're locked into doing it forever because the probability is 99%.
  • Exploration will die as we might never try other actions again even though they might be better.
  • The advantage estimates become wrong. Our critic learned that V(s)V(s) assumes the old policy. If the policy changes drastically, all our value estimates are garbage now.

We prefer to have smooth controlled updates. To achieve this, PPO asks a different question than vanilla Actor Critic.

  • Vanilla Actor Critic asks: How can I change my policy to make this good action more likely?
  • PPO asks: How can I change my policy to make this good action more likely BUT NOT TOO MUCH MORE LIKELY.

The "how much we've changed" is measured using a probability ratio:

rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}
  • rt(θ)=1r_t(\theta) = 1 means the new and old policy are the same
  • rt(θ)=2r_t(\theta) = 2 means the new policy is twice as likely to take this action as the old one
  • rt(θ)=0.5r_t(\theta) = 0.5 means the new policy is half as likely to take this action (we're suppressing it)

Additional Notes and Concepts

Policy Gradient Loss is defined as E[log(πθ(atst))A^t]E[\log(\pi_\theta(a_t|s_t))\hat A_t] with AtA_t being the advantage. It tries to estimate what is the relative value of the selected action in the current state. AtA_t is equal to the discounted sum of rewards minus a baseline estimate we get using some critic network or any other way. The baseline is usually set to be a value function neural network that is updated at each step. This network is updated in a supervised learning way.

What this loss L(θ)=E[logπθ(atst)A^t]L(\theta) = E[\log \pi_\theta (a_t|s_t) \hat A_t] will do is: if the advantage was positive (action was better than average return), then we increase the probability of selecting them again in the future if we encounter them again, and the inverse for inverse A.

When using policy gradient methods (REINFORCE, A2C, PPO, etc.), which are on-policy methods, we cannot reuse the same batch. This is because the derivatives assume our data was generated by the current version of the policy.

Vanilla REINFORCE with large step size is very unstable. Taking a big step will dramatically change the policy and this makes the advantage estimates that got computed with the old policy unreliable and not accurate (they don't reflect the current policy).

Monte Carlo methods also have very high variance.

PPO was built on "Trust Region Policy Optimization Paper".


TODOs and Future Topics

  • Add drawbacks, limitations, pros, cons

  • Add some analysis of the algo, when to use and when to not

  • Interpretations of how the loss will behave

  • What comes next, PPO, GRPO, etc.

  • Proof that N=1N = 1 results in unbiased estimates of the gradient

  • Give the example from statquest, where you use your own existing data to recreate a sub dataset and have better estimation or something like that. Bootstrapping is also used in tree based machine learning I believe.

  • Theoretical proof that it'll converge (TD learning convergence)

  • Soft Actor Critic

  • PPO (handles continuous action spaces apparently)

  • Rainbow DQN

  • Try to learn about some model based reinforcement learning

  • Try to learn about Reinforcement Learning from Demonstration (RLfD) / Imitation Learning + RL

  • Model based RL. Deep MPC, etc.

  • Add coding part for the REINFORCE algorithm

  • TRPO

© Raideno.