Empirical Recipes for Efficient and Compact Vision-Language Models
Jiabo Huang, Zhizhong Li, Sina Sajadmanesh, Weiming Zhuang, Lingjuan Lyu

TL;DR
This paper analyzes the efficiency bottlenecks of compact vision-language models, develops optimization techniques to significantly reduce inference latency, and introduces ArgusVLM, a new efficient model with strong performance.
Contribution
It provides an empirical end-to-end analysis of inference bottlenecks in compact VLMs and proposes optimization recipes that improve speed while maintaining accuracy, along with a new model family, ArgusVLM.
Findings
Latency reduced by up to 93% with optimization recipes.
Optimization techniques are broadly applicable across architectures.
ArgusVLM achieves strong performance with a compact design.
Abstract
Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fall short of the inference speedups their smaller parameter counts suggest. To explain this discrepancy, we conduct an empirical end-to-end efficiency analysis and systematically profile inference to identify the dominant bottlenecks. Based on these findings, we develop optimization recipes tailored to compact VLMs that substantially reduce latency while preserving accuracy. These techniques cut time to first token (TTFT) by 53% on InternVL3-2B and by 93% on SmolVLM-256M. Our recipes are broadly applicable across both VLM architectures and common serving frameworks, providing practical guidance for building efficient VLM systems. Beyond efficiency, we study how to extend compact VLMs with structured perception outputs and introduce the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
