On the "Causality" Step in Policy Gradient Derivations: A Pedagogical Reconciliation of Full Return and Reward-to-Go

Nima H. Siboni

arXiv:2604.04686·cs.AI·April 7, 2026

On the "Causality" Step in Policy Gradient Derivations: A Pedagogical Reconciliation of Full Return and Reward-to-Go

Nima H. Siboni

PDF

TL;DR

This paper clarifies the mathematical basis for replacing full trajectory returns with reward-to-go in policy gradient methods, emphasizing a rigorous derivation over heuristic explanations.

Contribution

It provides a clear, explicit derivation of reward-to-go from prefix trajectory distributions, clarifying the causality step in policy gradient derivations.

Findings

01

Reward-to-go naturally emerges from the decomposition over prefix trajectories.

02

The derivation clarifies the causality argument as a corollary, not a heuristic.

03

The estimator remains unchanged by the derivation.

Abstract

In introductory presentations of policy gradients, one often derives the REINFORCE estimator using the full trajectory return and then states, by ``causality,'' that the full return may be replaced by the reward-to-go. Although this statement is correct, it is frequently presented at a level of rigor that leaves unclear where the past-reward terms disappear. This short paper isolates that step and gives a mathematically explicit derivation based on prefix trajectory distributions and the score-function identity. The resulting account does not change the estimator. Its contribution is conceptual: instead of presenting reward-to-go as a post hoc unbiased replacement for full return, it shows that reward-to-go arises directly once the objective is decomposed over prefix trajectories. In this formulation, the usual causality argument is recovered as a corollary of the derivation rather than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.