Mamba Fusion: Learning Actions Through Questioning

Zhikang Dong; Apoorva Beedu; Jason Sheinkopf; Irfan Essa

arXiv:2409.11513·cs.CV·February 3, 2025

Mamba Fusion: Learning Actions Through Questioning

Zhikang Dong, Apoorva Beedu, Jason Sheinkopf, Irfan Essa

PDF

Open Access 1 Repo

TL;DR

MambaVL is a new vision-language model that efficiently captures long-range dependencies and improves action recognition and anticipation by using selective modality fusion and question-guided learning.

Contribution

It introduces MambaVL, a model leveraging state space modality fusion and question-answering tasks to enhance action understanding in videos.

Findings

01

Achieves state-of-the-art action recognition on Epic-Kitchens-100

02

Outperforms baselines in action anticipation

03

Efficiently models long-range dependencies with reduced computational complexity

Abstract

Video Language Models (VLMs) are crucial for generalizing across diverse tasks and using language cues to enhance learning. While transformer-based architectures have been the de facto in vision-language training, they face challenges like quadratic computational complexity, high GPU memory usage, and difficulty with long-term dependencies. To address these limitations, we introduce MambaVL, a novel model that leverages recent advancements in selective state space modality fusion to efficiently capture long-range dependencies and learn joint representations for vision and language data. MambaVL utilizes a shared state transition matrix across both modalities, allowing the model to capture information about actions from multiple perspectives within the scene. Furthermore, we propose a question-answering task that helps guide the model toward relevant cues. These questions provide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dongzhikang/mambavl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducation and Technology Integration