CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing

Wenhao Zheng; Yixiao Chen; Weitong Zhang; Souvik Kundu; Yun Li; Zhengzhong Liu; Eric P. Xing; Hongyi Wang; Huaxiu Yao

arXiv:2502.01976·cs.CL·September 11, 2025

CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing

Wenhao Zheng, Yixiao Chen, Weitong Zhang, Souvik Kundu, Yun Li, Zhengzhong Liu, Eric P. Xing, Hongyi Wang, Huaxiu Yao

PDF

Open Access 1 Repo 3 Reviews

TL;DR

CITER is a framework that improves large language model inference efficiency by routing tokens between small and large models based on their importance, reducing costs while maintaining quality.

Contribution

We introduce CITER, a novel token-level routing method that optimizes inference efficiency by dynamically collaborating between small and large language models.

Findings

01

Significant reduction in inference costs across five benchmarks.

02

Maintains high-quality generation comparable to full large models.

03

Effective token routing learned through policy optimization.

Abstract

Large language models have achieved remarkable success in various tasks but suffer from high computational costs during inference, limiting their deployment in resource-constrained applications. To address this issue, we propose a novel Collaborative Inference with Token-lEvel Routing (CITER) framework that enables efficient collaboration between small and large language models (SLMs \& LLMs) through a token-level routing strategy. Specifically, CITER routes non-critical tokens to an SLM for efficiency and routes critical tokens to an LLM for generalization quality. We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation. This allows the router to learn to predict token-level routing scores and make routing decisions based on both the current token and the future impact of its…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 3

Strengths

The proposed method is sound, and positive results are demonstrated across multiple benchmarks.

Weaknesses

1) At least from my option, I don't see significant advantages (differences) over existing collaborative decoding methods. For example, this paper cites the co-LLM, what is the core difference and what is the core difference between this work and the work by UW yejin's team? 2) only one policy is used (QWen) 3) paper is not clear to read, you do not need so many equations for sections like 2.1.2. I'm open to upgrade my score if the paper is significantly improved.

Reviewer 02Rating 5Confidence 2

Strengths

1. The token-level routing framework for collaborative inference is quite novel. The idea of using small language models to collaboratively generate tokens in order to reduce the inference generation of large language models is very interesting for accelerating model inference speed. 2. The experimental design for evaluating the CITER framework's inference acceleration is comprehensive and rich, with thorough experimental evaluations conducted across multiple benchmark datasets.

Weaknesses

1. The paper only conducts experiments with the Qwen series of models. If the model were switched to the Llama3 series, would the CITER architecture still be able to achieve rapid inference with large models? 2. The generality of the rapid convergence of the iterative training process is not supported by detailed evidence, which undermines the validity of the iterative training approach.

Reviewer 03Rating 6Confidence 4

Strengths

- The paper is well written and easy to follow. - The RL-based router training method is novel, and a shortcut to the reward function is proposed to make training easier. - Experimental results show that the proposed method can achieve better performance under the same call to LLM.

Weaknesses

- While the framework introduces a shortcut for estimating the reward function, the initial training of the token-level router still requires significant computational resources due to the need for reinforcement learning, which can be a barrier for practical implementation. - The effectiveness of CITER heavily relies on the accuracy of token importance predictions. If the router fails to accurately assess which tokens are critical, it could lead to suboptimal routing decisions, potentially compr

Code & Models

Repositories

aiming-lab/CITER
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Graph Neural Networks