FoPru: Focal Pruning for Efficient Large Vision-Language Models

Lei Jiang; Weizhe Huang; Tongxuan Liu; Yuting Zeng; Jing Li; Lechao; Cheng; Xiaohua Xu

arXiv:2411.14164·cs.CV·November 22, 2024

FoPru: Focal Pruning for Efficient Large Vision-Language Models

Lei Jiang, Weizhe Huang, Tongxuan Liu, Yuting Zeng, Jing Li, Lechao, Cheng, Xiaohua Xu

PDF

Open Access

TL;DR

FoPru is a training-free token pruning method for large vision-language models that significantly improves inference efficiency by removing redundant visual tokens without sacrificing accuracy.

Contribution

We introduce FoPru, a novel attention-based token pruning technique that enhances LVLM inference efficiency without additional training.

Findings

01

Reduces visual tokens substantially while maintaining accuracy.

02

Improves inference speed across various LVLMs and datasets.

03

Offers two pruning strategies: rank and row.

Abstract

Large Vision-Language Models (LVLMs) represent a significant advancement toward achieving superior multimodal capabilities by enabling powerful Large Language Models (LLMs) to understand visual input. Typically, LVLMs utilize visual encoders, such as CLIP, to transform images into visual tokens, which are then aligned with textual tokens through projection layers before being input into the LLM for inference. Although existing LVLMs have achieved significant success, their inference efficiency is still limited by the substantial number of visual tokens and the potential redundancy among them. To mitigate this issue, we propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder. Specifically, we introduce two alternative pruning strategies: 1) the rank strategy, which leverages all token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques

MethodsPruning · Contrastive Language-Image Pre-training