Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to   See

Zeliang Zhang; Phu Pham; Wentian Zhao; Kun Wan; Yu-Jhe Li; Jianing; Zhou; Daniel Miranda; Ajinkya Kale; Chenliang Xu

arXiv:2410.06169·cs.CV·December 3, 2024

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

Zeliang Zhang, Phu Pham, Wentian Zhao, Kun Wan, Yu-Jhe Li, Jianing, Zhou, Daniel Miranda, Ajinkya Kale, Chenliang Xu

PDF

Open Access 1 Repo

TL;DR

This paper identifies redundancy in visual token processing within MLLMs and proposes strategies to significantly reduce computational costs while maintaining performance, enabling more scalable multimodal models.

Contribution

The study introduces novel efficiency strategies for MLLMs that cut computational demands by 88%, validated across multiple models and benchmarks.

Findings

01

88% reduction in computational demands

02

Visual redundancy exists in multiple MLLMs

03

Maintains performance after pruning and layer dropping

Abstract

By treating visual tokens from visual encoders as text tokens, Multimodal Large Language Models (MLLMs) have achieved remarkable progress across diverse visual understanding tasks, leveraging the robust architectures of Large Language Models (LLMs). However, as token counts grow, the quadratic scaling of computation in LLMs introduces a significant efficiency bottleneck, impeding further scalability. Although recent approaches have explored pruning visual tokens or employing lighter LLM architectures, the computational overhead from an increasing number of visual tokens remains a substantial challenge. In this study, we investigate the redundancy in visual computation at both the parameter and computational pattern levels within LLaVA, a representative MLLM, and introduce a suite of streamlined strategies to enhance efficiency. These include neighbor-aware visual token attention,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ZhangAIPI/YOPO_MLLM_Pruning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need · Pruning · Focus