Shaking Up VLMs: Comparing Transformers and Structured State Space   Models for Vision & Language Modeling

Georgios Pantazopoulos; Malvina Nikandrou; Alessandro Suglia; Oliver; Lemon; Arash Eshghi

arXiv:2409.05395·cs.CV·October 2, 2024

Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling

Georgios Pantazopoulos, Malvina Nikandrou, Alessandro Suglia, Oliver, Lemon, Arash Eshghi

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper compares Transformers and structured state space models (Mamba) in vision and language tasks, finding Mamba excels in captioning and comprehension but lags in visual grounding and retrieval, highlighting task-dependent strengths.

Contribution

It introduces the use of Mamba, a structured state space model, as an alternative to Transformers in VLMs and systematically evaluates their performance across multiple tasks.

Findings

01

Mamba outperforms Transformers in captioning, question answering, and reading comprehension.

02

Transformers perform better in visual grounding and in-context multimodal retrieval.

03

Task-aware encoding has minimal impact on grounding performance.

Abstract

This study explores replacing Transformers in Visual Language Models (VLMs) with Mamba, a recent structured state space model (SSM) that demonstrates promising performance in sequence modeling. We test models up to 3B parameters under controlled conditions, showing that Mamba-based VLMs outperforms Transformers-based VLMs in captioning, question answering, and reading comprehension. However, we find that Transformers achieve greater performance in visual grounding and the performance gap widens with scale. We explore two hypotheses to explain this phenomenon: 1) the effect of task-agnostic visual encoding on the updates of the hidden states, and 2) the difficulty in performing visual grounding from the perspective of in-context multimodal retrieval. Our results indicate that a task-aware encoding yields minimal performance gains on grounding, however, Transformers significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gpantaz/vl_mamba
pytorchOfficial

Videos

Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling· underline

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces