Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects

Jun Zhang; Yicheng Ji; Feiyang Ren; Yihang Li; Bowen Zeng; Zonghao Chen; Ke Chen; Lidan Shou; Gang Chen; Huan Li

arXiv:2604.05546·cs.CL·April 15, 2026

Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects

Jun Zhang, Yicheng Ji, Feiyang Ren, Yihang Li, Bowen Zeng, Zonghao Chen, Ke Chen, Lidan Shou, Gang Chen, Huan Li

PDF

1 Repo

TL;DR

This paper systematically analyzes the efficiency bottlenecks in large vision-language models during inference, proposing a structured taxonomy of techniques and outlining future research directions.

Contribution

It provides the first end-to-end analysis of inference bottlenecks in LVLMs, integrating various optimization strategies and proposing a comprehensive taxonomy.

Findings

01

Identifies visual token dominance as a key efficiency barrier.

02

Analyzes how upstream decisions affect downstream bottlenecks.

03

Outlines future research directions with empirical insights.

Abstract

Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime interplay between high-resolution feature extraction, quadratic attention scaling, and memory bandwidth constraints. We present a systematic taxonomy of efficiency techniques structured around the inference lifecycle, consisting of encoding, prefilling, and decoding. Unlike prior reviews focused on isolated optimizations, we analyze the end-to-end pipeline to reveal how upstream decisions dictate downstream bottlenecks, covering compute-bound visual encoding, the intensive prefilling of massive contexts, and the ''visual memory wall'' in bandwidth-bound decoding. By decoupling the efficiency landscape into the axes of shaping information density, managing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SuDIS-ZJU/Efficient-LVLMs-Inference
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.