TokenPacker: Efficient Visual Projector for Multimodal LLM

Wentong Li; Yuqian Yuan; Jian Liu; Dongqi Tang; Song Wang; Jie Qin,; Jianke Zhu; Lei Zhang

arXiv:2407.02392·cs.CV·August 29, 2024

TokenPacker: Efficient Visual Projector for Multimodal LLM

Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin,, Jianke Zhu, Lei Zhang

PDF

Open Access 1 Repo

TL;DR

TokenPacker introduces a coarse-to-fine visual projection method that significantly reduces visual token redundancy in multimodal LLMs, enhancing efficiency without sacrificing reasoning performance.

Contribution

It proposes a novel coarse-to-fine visual projector that condenses visual tokens by 75-89% using multi-level region cues, improving efficiency while maintaining or improving performance.

Findings

01

Reduces visual tokens by up to 89%.

02

Achieves comparable or better performance on benchmarks.

03

Enhances MLLM efficiency significantly.

Abstract

The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visual contexts via one-to-one transformation. However, the visual tokens are redundant and can be considerably increased when dealing with high-resolution images, impairing the efficiency of MLLMs significantly. Some recent works have introduced resampler or abstractor to reduce the number of resulting visual tokens. Unfortunately, they fail to capture finer details and undermine the visual reasoning capabilities of MLLMs. In this work, we propose a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens. In specific, we first interpolate the visual features as a low-resolution point query, providing the overall…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

circleradon/tokenpacker
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Semantic Web and Ontologies