DeCo: Decoupling Token Compression from Semantic Abstraction in   Multimodal Large Language Models

Linli Yao; Lei Li; Shuhuai Ren; Lean Wang; Yuanxin Liu; Xu Sun; Lu Hou

arXiv:2405.20985·cs.CV·June 3, 2024·1 cites

DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models

Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, Lu Hou

PDF

Open Access 1 Repo

TL;DR

This paper introduces DeCo, a method that decouples visual token compression from semantic abstraction in multimodal large language models, leading to improved efficiency and performance by simplifying the visual processing pipeline.

Contribution

The paper proposes a novel approach called DeCo that separates visual token compression from semantic abstraction, enhancing training efficiency and model performance.

Findings

01

DeCo outperforms traditional projectors in accuracy and efficiency.

02

DeCo achieves up to 7.1% performance improvement on benchmarks.

03

DeCo reduces trainable parameters and speeds up convergence.

Abstract

The visual projector, which bridges the vision and language modalities and facilitates cross-modal alignment, serves as a crucial component in MLLMs. However, measuring the effectiveness of projectors in vision-language alignment remains under-explored, which currently can only be inferred from the performance of MLLMs on downstream tasks. Motivated by the problem, this study examines the projector module by interpreting the vision-language semantic flow within MLLMs. Specifically, we trace back the semantic relevance flow from generated language tokens to raw visual encoder patches and the intermediate outputs produced by projectors. Our findings reveal that compressive projectors (e.g., QFormer), abstract visual patches into a limited set of semantic concepts, such as objects or attributes, resulting in a 'double abstraction' phenomenon. This involves a first visual semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yaolinli/deco
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSparse Evolutionary Training