TL;DR
This paper systematically analyzes the efficiency bottlenecks in large vision-language models during inference, proposing a structured taxonomy of techniques and outlining future research directions.
Contribution
It provides the first end-to-end analysis of inference bottlenecks in LVLMs, integrating various optimization strategies and proposing a comprehensive taxonomy.
Findings
Identifies visual token dominance as a key efficiency barrier.
Analyzes how upstream decisions affect downstream bottlenecks.
Outlines future research directions with empirical insights.
Abstract
Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime interplay between high-resolution feature extraction, quadratic attention scaling, and memory bandwidth constraints. We present a systematic taxonomy of efficiency techniques structured around the inference lifecycle, consisting of encoding, prefilling, and decoding. Unlike prior reviews focused on isolated optimizations, we analyze the end-to-end pipeline to reveal how upstream decisions dictate downstream bottlenecks, covering compute-bound visual encoding, the intensive prefilling of massive contexts, and the ''visual memory wall'' in bandwidth-bound decoding. By decoupling the efficiency landscape into the axes of shaping information density, managing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
