ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression
Tingcheng Bian, Yuzhe Zhang, Jing Jin, Jinchang Luo, MingQuan Cheng, Haiwei Wang, Wenyuan Jiang, Miaohui Wang

TL;DR
ExpThink introduces an experience-guided reinforcement learning framework that dynamically balances reasoning accuracy and token efficiency, significantly reducing response length while improving performance on mathematical reasoning tasks.
Contribution
It proposes a novel RL approach with experience-guided reward shaping and difficulty-adaptive advantage to enhance CoT compression without manual tuning.
Findings
Reduces average response length by up to 77%.
Achieves up to 3x higher accuracy-efficiency ratio.
Outperforms existing RL-based compression methods.
Abstract
Large reasoning models (LRMs) achieve strong performance via extended chain-of-thought (CoT) reasoning, yet suffer from excessive token consumption and high inference latency. Existing reinforcement learning (RL) approaches for CoT compression rely on uniform, static length penalties that neglect model capability dynamics and problem-level difficulty variation. We propose \textbf{ExpThink}\xspace, an RL framework that addresses both dimensions through two complementary mechanisms. First, \emph{experience-guided reward shaping} tracks the shortest correct solution found so far for each problem and applies a three-tier reward: full credit for concise correct responses, discounted credit for verbose correct ones, and zero for incorrect ones. The threshold tightens automatically with model improvement, forming a self-evolving curriculum that requires no manual scheduling. Second,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
