Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
Armaan A. Abraham, Lucy Xiaoyang Shi, Chelsea Finn

TL;DR
This paper introduces long-horizon Q-learning (LQL), a method that stabilizes value learning over long horizons by penalizing violations of n-step inequalities, improving performance in reinforcement learning tasks.
Contribution
LQL provides a practical stabilization mechanism for Q-learning by leveraging n-step inequalities without extra networks or passes, enhancing long-horizon value estimation.
Findings
LQL outperforms 1-step and n-step TD learning on various benchmarks.
LQL requires no additional networks or forward passes compared to standard Q-learning.
LQL consistently improves performance when combined with state-of-the-art methods.
Abstract
Off-policy, value-based reinforcement learning methods such as Q-learning are appealing because they can learn from arbitrary experience, including data collected by older policies or other agents. In practice, however, bootstrapping makes long-horizon learning brittle: estimation errors at later states propagate backward through temporal-difference (TD) updates and can compound over time. We propose long-horizon Q-learning (LQL), which introduces a principled backstop against compounding error when learning the optimal action-value function. LQL builds on a prior optimality tightening observation: any realized action sequence lower-bounds what the optimal policy can achieve in expectation, so acting optimally earlier should not be worse than following the observed actions for several steps before switching to optimal behavior. Our contribution is to turn this inequality into a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
