On the "Causality" Step in Policy Gradient Derivations: A Pedagogical Reconciliation of Full Return and Reward-to-Go
Nima H. Siboni

TL;DR
This paper clarifies the mathematical basis for replacing full trajectory returns with reward-to-go in policy gradient methods, emphasizing a rigorous derivation over heuristic explanations.
Contribution
It provides a clear, explicit derivation of reward-to-go from prefix trajectory distributions, clarifying the causality step in policy gradient derivations.
Findings
Reward-to-go naturally emerges from the decomposition over prefix trajectories.
The derivation clarifies the causality argument as a corollary, not a heuristic.
The estimator remains unchanged by the derivation.
Abstract
In introductory presentations of policy gradients, one often derives the REINFORCE estimator using the full trajectory return and then states, by ``causality,'' that the full return may be replaced by the reward-to-go. Although this statement is correct, it is frequently presented at a level of rigor that leaves unclear where the past-reward terms disappear. This short paper isolates that step and gives a mathematically explicit derivation based on prefix trajectory distributions and the score-function identity. The resulting account does not change the estimator. Its contribution is conceptual: instead of presenting reward-to-go as a post hoc unbiased replacement for full return, it shows that reward-to-go arises directly once the objective is decomposed over prefix trajectories. In this formulation, the usual causality argument is recovered as a corollary of the derivation rather than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
