An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang, Zhou, Baobao Chang

TL;DR
This paper introduces FastV, a plug-and-play method that significantly reduces the computational cost of large vision-language models by learning adaptive attention patterns and pruning visual tokens, enabling efficient deployment without performance loss.
Contribution
FastV is a novel, versatile approach that optimizes attention computation in LVLMs, achieving substantial FLOP reduction and enabling efficient inference on edge devices.
Findings
FastV reduces FLOPs by up to 45% in LLaVA-1.5-13B.
FastV maintains performance while significantly decreasing computational costs.
FastV's Pareto-efficient design allows model compression below smaller models with better performance.
Abstract
In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI
MethodsPruning
