TL;DR
Tandem is a collaborative framework that combines large and small language models to perform reasoning tasks more efficiently, reducing computational costs by about 40% while maintaining high performance.
Contribution
The paper introduces a novel LLM-SLM collaboration approach with a cost-aware termination mechanism for efficient reasoning, with code available online.
Findings
Reduces computational costs by approximately 40% compared to standalone LLM reasoning.
Achieves superior or competitive performance on mathematical reasoning and code generation benchmarks.
Sufficiency classifier transfers effectively across different domains without retraining.
Abstract
Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning-intensive inference paradigms, where models perform explicit step-by-step reasoning before generating final answers. While such approaches improve answer quality and interpretability, they incur substantial computational overhead due to the prolonged generation sequences. In this paper, we propose Tandem, a novel collaborative framework that synergizes large and small language models (LLMs and SLMs) to achieve high-quality reasoning with significantly reduced computational cost. Specifically, the LLM serves as a strategic coordinator, efficiently generating a compact set of critical reasoning insights. These insights are then used to guide a smaller, more efficient SLM in executing the full reasoning process and delivering the final response. To balance efficiency and reliability, Tandem introduces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
