An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference   Acceleration for Large Vision-Language Models

Liang Chen; Haozhe Zhao; Tianyu Liu; Shuai Bai; Junyang Lin; Chang; Zhou; Baobao Chang

arXiv:2403.06764·cs.CV·September 4, 2024·1 cites

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang, Zhou, Baobao Chang

PDF

Open Access 1 Repo

TL;DR

This paper introduces FastV, a plug-and-play method that significantly reduces the computational cost of large vision-language models by learning adaptive attention patterns and pruning visual tokens, enabling efficient deployment without performance loss.

Contribution

FastV is a novel, versatile approach that optimizes attention computation in LVLMs, achieving substantial FLOP reduction and enabling efficient inference on edge devices.

Findings

01

FastV reduces FLOPs by up to 45% in LLaVA-1.5-13B.

02

FastV maintains performance while significantly decreasing computational costs.

03

FastV's Pareto-efficient design allows model compression below smaller models with better performance.

Abstract

In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pkunlp-icler/fastv
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI

MethodsPruning