AdaFV: Rethinking of Visual-Language alignment for VLM acceleration

Jiayi Han; Liang Du; Yiwen Wu; Xiangguo Zhou; Hongwei Du; Weibo Zheng

arXiv:2501.09532·cs.CV·February 4, 2025

AdaFV: Rethinking of Visual-Language alignment for VLM acceleration

Jiayi Han, Liang Du, Yiwen Wu, Xiangguo Zhou, Hongwei Du, Weibo Zheng

PDF

Open Access

TL;DR

This paper introduces AdaFV, a novel method for accelerating vision-language models by dynamically selecting visual tokens based on saliency and text-image similarity, improving efficiency without extra training.

Contribution

AdaFV proposes a self-adaptive cross-modality attention mechanism that effectively filters visual tokens, enhancing VLM efficiency without additional training costs.

Findings

01

Achieves state-of-the-art training-free VLM acceleration.

02

Effectively reduces visual tokens while maintaining accuracy.

03

Performs well at high token reduction rates.

Abstract

The success of VLMs often relies on the dynamic high-resolution schema that adaptively augments the input images to multiple crops, so that the details of the images can be retained. However, such approaches result in a large number of redundant visual tokens, thus significantly reducing the efficiency of the VLMs. To improve the VLMs' efficiency without introducing extra training costs, many research works are proposed to reduce the visual tokens by filtering the uninformative visual tokens or aggregating their information. Some approaches propose to reduce the visual tokens according to the self-attention of VLMs, which are biased, to result in inaccurate responses. The token reduction approaches solely rely on visual cues are text-agnostic, and fail to focus on the areas that are most relevant to the question, especially when the queried objects are non-salient to the image. In this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing · Advanced Fiber Optic Sensors · Retinal Imaging and Analysis

MethodsSoftmax · Attention Is All You Need · Focus