Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters
Kevin Y. Li, Sachin Goyal, Joao D. Semedo, J. Zico Kolter

TL;DR
This paper investigates the optimal balance between visual tokens and model size in vision-language models, revealing that minimal visual tokens combined with larger models yield the best inference efficiency for visual reasoning tasks.
Contribution
It establishes scaling laws for visual tokens and model size trade-offs, and introduces prompt-based token compression methods tailored for high-compression regimes.
Findings
Optimal inference occurs with minimal visual tokens and largest feasible LLMs.
High compression ratios in visual tokens significantly improve inference efficiency.
Prompt-based token compression is effective for high-compression settings.
Abstract
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks, driven by incorporating image representations into the token inputs of Large Language Models (LLMs). However, their real-world deployment is often constrained by high latency during inference due to the substantial compute required by the LLM to process the large number of input tokens, predominantly arising from the image. To reduce inference costs, one can either downsize the LLM or reduce the number of input tokens needed to represent the image, the latter of which has been the focus of many recent efforts around token compression. However, it is unclear what the optimal trade-off is given a fixed inference budget. We first characterize this optimal trade-off between the number of visual tokens and LLM parameters by establishing scaling laws that capture…
Peer Reviews
Decision·ICLR 2025 Poster
1. The idea of establishing the scaling law among visual token numbers and model sizes is interesting. They explore optimizing inference costs in VLMs by using a single visual token with the largest possible LLM within a given budget. 2. The paper offers a thorough analysis of the trade-offs between LLM size and the number of visual tokens, covering various scenarios and use cases. This comprehensive approach provides a deeper understanding of VLM optimization. 3. The paper is well-organized and
1. The main concern is the generalization of the scaling law. The paper focuses on visual reasoning tasks and may not fully explore other types of tasks where a single visual token might not be sufficient. For instance, tasks that require detailed image analysis might not benefit as much from such extreme token compression. 2. While the scaling laws developed in the paper are insightful, they are based on a specific set of experiments and models. The findings are heavily dependent on the specif
This paper provides valuable insights showing that larger LLMs enhance visual reasoning performance more than reducing visual tokens. Also, the introduced compression algorithm QueCC demonstrates better performance on benchmarks with high compression, proving its effectiveness.
While the paper demonstrates that larger model sizes can be more effective than increasing token counts for visual reasoning tasks, this approach appears less effective for OCR-related tasks, as acknowledged by the authors. Given that many real-world applications require fine-grained visual understanding, the proposed compression method may not fully address these demands, as evidenced by its performance on the TextVQA benchmark in Table 1. Although the authors provide valuable insights, their c
1. The authors do not limit the compression of VLM to vision, but dynamically explore the relationship between LLM and visual tokens. The motivation of the paper is sufficient. 2. The practical guideline provides a novel insight into VLM efficiency. The author carefully points out the limitations of the scaling law, which is not applicable to OCR tasks. 3. The authors conduct comprehensive experiments on various metrics and show an improvement. Furthermore, the discussion is in detail.
1. The employment of cross-attention in QueCC to compress information is common [1]. 2. LLaMA-VID [2] and VoCo-LLaMA [3] have already done token compression in extreme regimes, which is impressive. The author should compare their performance with QueCC. It seems that QueCC is inferior to VoCo-LLaMA on benchmarks such as GQA and MME. 3. There is a lack of ablation experiments, especially the analysis of depth-wise 2D convolution and the injection of text-embedding. 4. The authors’ scaling law see
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRetinal Imaging and Analysis · Optical Coherence Tomography Applications · Data Stream Mining Techniques
MethodsFocus · Balanced Selection
