CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing
Wenhao Zheng, Yixiao Chen, Weitong Zhang, Souvik Kundu, Yun Li, Zhengzhong Liu, Eric P. Xing, Hongyi Wang, Huaxiu Yao

TL;DR
CITER is a framework that improves large language model inference efficiency by routing tokens between small and large models based on their importance, reducing costs while maintaining quality.
Contribution
We introduce CITER, a novel token-level routing method that optimizes inference efficiency by dynamically collaborating between small and large language models.
Findings
Significant reduction in inference costs across five benchmarks.
Maintains high-quality generation comparable to full large models.
Effective token routing learned through policy optimization.
Abstract
Large language models have achieved remarkable success in various tasks but suffer from high computational costs during inference, limiting their deployment in resource-constrained applications. To address this issue, we propose a novel Collaborative Inference with Token-lEvel Routing (CITER) framework that enables efficient collaboration between small and large language models (SLMs \& LLMs) through a token-level routing strategy. Specifically, CITER routes non-critical tokens to an SLM for efficiency and routes critical tokens to an LLM for generalization quality. We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation. This allows the router to learn to predict token-level routing scores and make routing decisions based on both the current token and the future impact of its…
Peer Reviews
Decision·Submitted to ICLR 2025
The proposed method is sound, and positive results are demonstrated across multiple benchmarks.
1) At least from my option, I don't see significant advantages (differences) over existing collaborative decoding methods. For example, this paper cites the co-LLM, what is the core difference and what is the core difference between this work and the work by UW yejin's team? 2) only one policy is used (QWen) 3) paper is not clear to read, you do not need so many equations for sections like 2.1.2. I'm open to upgrade my score if the paper is significantly improved.
1. The token-level routing framework for collaborative inference is quite novel. The idea of using small language models to collaboratively generate tokens in order to reduce the inference generation of large language models is very interesting for accelerating model inference speed. 2. The experimental design for evaluating the CITER framework's inference acceleration is comprehensive and rich, with thorough experimental evaluations conducted across multiple benchmark datasets.
1. The paper only conducts experiments with the Qwen series of models. If the model were switched to the Llama3 series, would the CITER architecture still be able to achieve rapid inference with large models? 2. The generality of the rapid convergence of the iterative training process is not supported by detailed evidence, which undermines the validity of the iterative training approach.
- The paper is well written and easy to follow. - The RL-based router training method is novel, and a shortcut to the reward function is proposed to make training easier. - Experimental results show that the proposed method can achieve better performance under the same call to LLM.
- While the framework introduces a shortcut for estimating the reward function, the initial training of the token-level router still requires significant computational resources due to the need for reinforcement learning, which can be a barrier for practical implementation. - The effectiveness of CITER heavily relies on the accuracy of token importance predictions. If the router fails to accurately assess which tokens are critical, it could lead to suboptimal routing decisions, potentially compr
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Graph Neural Networks
