Making Small Language Models Efficient Reasoners: Intervention, Supervision, Reinforcement
Xuechen Zhang, Zijian Huang, Chenshun Ni, Ziyang Xiong, Jiasi Chen, Samet Oymak

TL;DR
This paper introduces algorithms to improve token-efficient reasoning in small language models by controlling stopping points and trace length, significantly reducing computation without sacrificing accuracy.
Contribution
It proposes temperature scaling and TLDR reinforcement learning methods to optimize reasoning trace length and stopping points in small models, enhancing efficiency and flexibility.
Findings
TLDR reduces token usage by about 50%.
Temperature scaling effectively controls reasoning length.
Methods outperform baseline approaches in benchmarks.
Abstract
Recent research enhances language model reasoning by scaling test-time compute via longer chain-of-thought traces. This often improves accuracy but also introduces redundancy and high computational cost, especially for small language models distilled with supervised fine-tuning (SFT). In this work, we propose new algorithms to improve token-efficient reasoning with small-scale models by effectively trading off accuracy and computation. We first show that the post-SFT model fails to determine the optimal stopping point of the reasoning process, resulting in verbose and repetitive outputs. Verbosity also significantly varies across wrong vs correct responses. To address these issues, we propose two solutions: (1) Temperature scaling (TS) to control the stopping point for the thinking phase and thereby trace length, and (2) TLDR: a length-regularized reinforcement learning method based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsShrink and Fine-Tune · Spatio-temporal stability analysis
