Efficient Vision-Language Models by Summarizing Visual Tokens into   Compact Registers

Yuxin Wen; Qingqing Cao; Qichen Fu; Sachin Mehta; Mahyar Najibi

arXiv:2410.14072·cs.CV·October 21, 2024

Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers

Yuxin Wen, Qingqing Cao, Qichen Fu, Sachin Mehta, Mahyar Najibi

PDF

Open Access

TL;DR

This paper introduces Victor, a method that summarizes visual tokens into a small set of register tokens, significantly reducing computational costs in vision-language models with minimal accuracy loss.

Contribution

Victor provides an efficient way to reduce visual tokens in VLMs by summarizing them into learnable registers, improving speed with minimal performance impact.

Findings

01

Reduces training time by 43%

02

Increases inference throughput by 3.3x

03

Maintains accuracy with less than 4% drop

Abstract

Recent advancements in vision-language models (VLMs) have expanded their potential for real-world applications, enabling these models to perform complex reasoning on images. In the widely used fully autoregressive transformer-based models like LLaVA, projected visual tokens are prepended to textual tokens. Oftentimes, visual tokens are significantly more than prompt tokens, resulting in increased computational overhead during both training and inference. In this paper, we propose Visual Compact Token Registers (Victor), a method that reduces the number of visual tokens by summarizing them into a smaller set of register tokens. Victor adds a few learnable register tokens after the visual tokens and summarizes the visual information into these registers using the first few layers in the language tower of VLMs. After these few layers, all visual tokens are discarded, significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques

MethodsSparse Evolutionary Training