DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng, Gao, Yu-Gang Jiang

TL;DR
DeepStack introduces a simple yet effective method of stacking visual tokens across transformer layers in large multimodal models, significantly improving performance with minimal additional computational cost.
Contribution
The paper proposes DeepStack, a novel architecture that stacks visual tokens into multiple groups across layers, enhancing interaction modeling in LMMs with minimal extra cost.
Findings
DeepStack improves LMM performance by 2.7-2.9 points on 9 benchmarks.
It achieves comparable results with only one-fifth of the context length.
Significant gains on high-resolution tasks like TextVQA, DocVQA, and InfoVQA.
Abstract
Most large multimodal models (LMMs) are implemented by feeding visual tokens as a sequence into the first layer of a large language model (LLM). The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in its input layer. This paper presents a new architecture DeepStack for LMMs. Considering layers in the language and vision transformer of LMMs, we stack the visual tokens into groups and feed each group to its aligned transformer layer \textit{from bottom to top}. Surprisingly, this simple method greatly enhances the power of LMMs to model interactions among visual tokens across layers but with minimal additional cost. We apply DeepStack to both language and vision transformer in LMMs, and validate the effectiveness of DeepStack LMMs with extensive empirical results. Using the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices
MethodsAttention Is All You Need · Softmax · Linear Layer · Residual Connection · Multi-Head Attention · Dense Connections · Layer Normalization · Vision Transformer
