HiMix: Reducing Computational Complexity in Large Vision-Language Models

Xuange Zhang; Dengjie Li; Bo Liu; Zenghao Bao; Yao Zhou; Baisong Yang,; Zhongying Liu; Yujie Zhong; Zheng Zhao; Tongtong Yuan

arXiv:2501.10318·cs.CV·January 20, 2025

HiMix: Reducing Computational Complexity in Large Vision-Language Models

Xuange Zhang, Dengjie Li, Bo Liu, Zenghao Bao, Yao Zhou, Baisong Yang,, Zhongying Liu, Yujie Zhong, Zheng Zhao, Tongtong Yuan

PDF

Open Access

TL;DR

HiMix introduces a hierarchical interaction mechanism that significantly reduces the computational cost of large vision-language models by limiting vision sequence processing, achieving a 10x efficiency gain with minimal performance loss.

Contribution

The paper proposes HiMix, a novel hierarchical vision-language interaction method that reduces computational complexity in LVLMs by selectively interacting vision and language features.

Findings

01

Achieves 10x reduction in language decoder computational cost

02

Maintains comparable performance to full models

03

Provides a new perspective on efficient vision-language modeling

Abstract

Benefiting from recent advancements in large language models and modality alignment techniques, existing Large Vision-Language Models(LVLMs) have achieved prominent performance across a wide range of scenarios. However, the excessive computational complexity limits the widespread use of these models in practical applications. We argue that one main bottleneck in computational complexity is caused by the involvement of redundant vision sequences in model computation. This is inspired by a reassessment of the efficiency of vision and language information transmission in the language decoder of LVLMs. Then, we propose a novel hierarchical vision-language interaction mechanism called Hierarchical Vision injection for Mixture Attention (HiMix). In HiMix, only the language sequence undergoes full forward propagation, while the vision sequence interacts with the language at specific stages…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need