FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual   Token Compression

Yuke Zhu; Chi Xie; Shuang Liang; Bo Zheng; Sheng Guo

arXiv:2411.14228·cs.CV·November 22, 2024

FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression

Yuke Zhu, Chi Xie, Shuang Liang, Bo Zheng, Sheng Guo

PDF

Open Access

TL;DR

FocusLLaVA introduces a coarse-to-fine visual token compression method that reduces computational costs while enhancing performance in multi-modal large language models by intelligently selecting and compressing visual information.

Contribution

It proposes a novel coarse-to-fine visual token compression approach with vision-guided and text-guided samplers, improving both efficiency and performance.

Findings

01

Significant reduction in visual token input size.

02

Improved model performance on various datasets.

03

Enhanced efficiency without performance trade-offs.

Abstract

Recent advances on Multi-modal Large Language Models have demonstrated that high-resolution image input is crucial for model capabilities, especially for fine-grained tasks. However, high-resolution images lead to a quadratic increase in the number of visual tokens input into LLMs, resulting in significant computational costs. Current work develop visual token compression methods to achieve efficiency improvements, often at the expense of performance. We argue that removing visual redundancy can simultaneously improve both efficiency and performance. We build a coarse-to-fine visual token compression method, with a vision-guided sampler for compressing redundant regions with low information density, and a text-guided sampler for selecting visual tokens that are strongly correlated with the user instructions.With these two modules, the proposed FocusLLaVA achieves improvements in both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Image and Video Stabilization