Rethinking Model Efficiency: Multi-Agent Inference with Large Models

Sixun Dong; Juhua Hu; Steven Li; Wei Wen; Qi Qian

arXiv:2604.04929·cs.CV·April 7, 2026

Rethinking Model Efficiency: Multi-Agent Inference with Large Models

Sixun Dong, Juhua Hu, Steven Li, Wei Wen, Qi Qian

PDF

TL;DR

This paper proposes a multi-agent inference framework that enhances vision-language model efficiency by transferring reasoning tokens from smaller models to large models, reducing latency while maintaining performance.

Contribution

It introduces a novel multi-agent inference approach that leverages token transfer to improve efficiency of large models without sacrificing accuracy.

Findings

01

Large models with fewer output tokens can outperform smaller models with longer outputs.

02

Transferring reasoning tokens from small to large models approaches large model performance.

03

The proposed method reduces inference latency while maintaining or improving accuracy.

Abstract

Most vision-language models (VLMs) apply a large language model (LLM) as the decoder, where the response tokens are generated sequentially through autoregression. Therefore, the number of output tokens can be the bottleneck of the end-to-end latency. However, different models may require vastly different numbers of output tokens to achieve comparable performance. In this work, we conduct a comprehensive analysis of the latency across different components of VLMs on simulated data. The experiment shows that a large model with fewer output tokens can be more efficient than a small model with a long output sequence. The empirical study on diverse real-world benchmarks confirms the observation that a large model can achieve better or comparable performance as a small model with significantly fewer output tokens. To leverage the efficiency of large models, we propose a multi-agent inference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.