DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and   Effective for LMMs

Lingchen Meng; Jianwei Yang; Rui Tian; Xiyang Dai; Zuxuan Wu; Jianfeng; Gao; Yu-Gang Jiang

arXiv:2406.04334·cs.CV·June 7, 2024

DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng, Gao, Yu-Gang Jiang

PDF

Open Access 6 Models

TL;DR

DeepStack introduces a simple yet effective method of stacking visual tokens across transformer layers in large multimodal models, significantly improving performance with minimal additional computational cost.

Contribution

The paper proposes DeepStack, a novel architecture that stacks visual tokens into multiple groups across layers, enhancing interaction modeling in LMMs with minimal extra cost.

Findings

01

DeepStack improves LMM performance by 2.7-2.9 points on 9 benchmarks.

02

It achieves comparable results with only one-fifth of the context length.

03

Significant gains on high-resolution tasks like TextVQA, DocVQA, and InfoVQA.

Abstract

Most large multimodal models (LMMs) are implemented by feeding visual tokens as a sequence into the first layer of a large language model (LLM). The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in its input layer. This paper presents a new architecture DeepStack for LMMs. Considering $N$ layers in the language and vision transformer of LMMs, we stack the visual tokens into $N$ groups and feed each group to its aligned transformer layer \textit{from bottom to top}. Surprisingly, this simple method greatly enhances the power of LMMs to model interactions among visual tokens across layers but with minimal additional cost. We apply DeepStack to both language and vision transformer in LMMs, and validate the effectiveness of DeepStack LMMs with extensive empirical results. Using the same…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices

MethodsAttention Is All You Need · Softmax · Linear Layer · Residual Connection · Multi-Head Attention · Dense Connections · Layer Normalization · Vision Transformer