Training Large Reasoning Models Efficiently via Progressive Thought Encoding
Zeliang Zhang, Xiaodong Liu, Hao Cheng, Hao Sun, Chenliang Xu, and Jianfeng Gao

TL;DR
This paper introduces Progressive Thought Encoding, a memory-efficient fine-tuning method that enhances reasoning accuracy of large models under fixed-size caches, significantly reducing training memory requirements.
Contribution
We propose a novel Progressive Thought Encoding technique that enables large reasoning models to reason effectively with fixed-size caches, improving training efficiency and scalability.
Findings
Achieves up to 23.4% accuracy improvement on AIME benchmarks.
Outperforms LoRA-based fine-tuning by 19.3%.
Reduces memory usage during training while maintaining inference performance.
Abstract
Large reasoning models (LRMs) excel on complex problems but face a critical barrier to efficiency: reinforcement learning (RL) training requires long rollouts for outcome-based rewards, where autoregressive decoding dominates time and memory usage. While sliding-window cache strategies can bound memory, they disrupt long-context reasoning and degrade performance. We introduce Progressive Thought Encoding, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches. By progressively encoding intermediate reasoning into fixed-size vector representations, our approach eliminates the need to backpropagate through full-cache rollouts, thereby reducing memory usage, while maintaining constant memory during inference. Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, on six widely used…
Peer Reviews
Decision·ICLR 2026 Poster
Good practical method to reduce memory footprint and compute during RL stage for reasoning models
The comparison with prior work seems not complete. The proposed method uses global tokens. Should not these tokens be also used in LoRA and LoRA_c for fair comparison? I would recommend to remove sections 2.1, 2.2 and 2.3 from the "related work" . This is common knowledge very weakly related to proposed method.
1. The proposed method outperforms both LoRA and LoRA_c in terms of accuracy across various models and evaluation datasets, while requiring only a marginal increase in computational resources compared to LoRA_c. 2. The method's reliability is demonstrated through comprehensive evaluation across multiple benchmark datasets.
1. Among the three models evaluated in this paper, Qwen2.5-4B-Instruct and Qwen2.5-7B-Instruct are not LRM. Their output lengths are shorter compared to DeepSeek-R1-Distill-Llama-8B, making their persuasiveness less compelling. 2. The maximum sequence length is only 3072, lacking evaluation experiments at longer reasoning lengths.
- The idea of encoding the “to be evicted token” into latent vectors which are them embedded into LoRA adapter is very novel and clever. This makes the Lora adapter like a learned dynamic memory without blowing up KV cache linearly - Well formulated and simple idea with clear experiments - Good ablations and analysis
- Results and experimental setup is not convincing enough. - Unclear if it will apply to full finetuned models compared to the LoRA only setting in this paper. Also unclear if this will translate over to larger models but I understand that will be outside the scope of this study. - max sequence length of 3072 is not enough for reasoning RL runs and doesn’t give me confidence in the results especially given the authors tout this as benefitial for long reasoning RL. Only at longer context lengths
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
