Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL
Ian Wu, Yuxiao Qu, Amrith Setlur, Aviral Kumar

TL;DR
This paper introduces RC, an iterative decoding method enabling LLMs to continually improve reasoning over long horizons, significantly enhancing performance on complex tasks beyond training constraints.
Contribution
The paper presents RC, a novel iterative decoding algorithm that allows LLMs to extrapolate and improve reasoning capabilities over much longer horizons than previously possible.
Findings
Models trained with RC outperform baseline models on reasoning tasks.
RC enables models to extrapolate reasoning beyond training horizons.
Training with RC improves the effective use of scaffolds for better performance.
Abstract
Large Language Models (LLMs) that can continually improve beyond their training budgets are able to solve increasingly difficult problems by adapting at test time, a property we refer to as extrapolation. However, standard reinforcement learning (RL) operates over fixed problem distributions and training budgets, which limits extrapolation amidst distribution shift at test time. To address this, we introduce RC, an iterative decoding algorithm that replaces standard autoregressive decoding during both training and inference. RC exploits an asymmetry between the response generation and summarization capabilities of LLMs to construct reasoning chains that consistently improve across iterations. Models trained to use RC can extrapolate and continually improve over reasoning horizons more than an order of magnitude longer than those seen during training. Empirically, training a 4B model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
