TL;DR
GlimpRouter is a training-free framework that improves reasoning efficiency by routing steps to large or small models based on initial token entropy, reducing latency while maintaining accuracy.
Contribution
It introduces a novel, entropy-based, step-wise collaboration method that predicts reasoning step difficulty from the first token, enabling efficient inference without additional training.
Findings
Achieves 10.7% accuracy improvement on AIME25.
Reduces inference latency by 25.9% compared to large models.
Effectively predicts step difficulty using initial token entropy.
Abstract
Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the "Aha Moment" phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
