On Training in Imagination
Nadav Timor, Ravid Shwartz-Ziv, Micah Goldblum, Yann LeCun, David Harel

TL;DR
This paper analyzes how errors in learned models affect training in model-based reinforcement learning, proposing optimal sample allocation and examining the impact of reward noise on policy optimization.
Contribution
It extends existing analysis to include learned reward models, derives optimal sample ratios, and studies the effects of reward noise on policy gradient estimates.
Findings
Optimal sample ratio minimizes return error bound.
Lower Lipschitz constants improve model accuracy and bound tightness.
Reward noise affects gradient variance, influencing sample tradeoffs.
Abstract
State-of-the-art model-based reinforcement learning methods train policies on imagined rollouts. These rollouts are trajectories generated by a learned dynamics model and are scored by a learned reward model, but without querying the true environment during policy updates. We study this training paradigm by quantifying how errors in learned dynamics and reward models affect returns and policy optimization. First, we extend the analysis of Asadi et al. (2018) to MDPs with learned reward models, and derive the optimal sample allocation--the ratio of dynamics samples to reward samples that minimizes a bound on return error under power-law scaling assumptions. We identify lower Lipschitz constants of the learned dynamics, reward, and policy as a representation desideratum that tightens this bound, and we connect this perspective to the temporal-straightening objective of Wang et al. (2026).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
