Mamba Fusion: Learning Actions Through Questioning
Zhikang Dong, Apoorva Beedu, Jason Sheinkopf, Irfan Essa

TL;DR
MambaVL is a new vision-language model that efficiently captures long-range dependencies and improves action recognition and anticipation by using selective modality fusion and question-guided learning.
Contribution
It introduces MambaVL, a model leveraging state space modality fusion and question-answering tasks to enhance action understanding in videos.
Findings
Achieves state-of-the-art action recognition on Epic-Kitchens-100
Outperforms baselines in action anticipation
Efficiently models long-range dependencies with reduced computational complexity
Abstract
Video Language Models (VLMs) are crucial for generalizing across diverse tasks and using language cues to enhance learning. While transformer-based architectures have been the de facto in vision-language training, they face challenges like quadratic computational complexity, high GPU memory usage, and difficulty with long-term dependencies. To address these limitations, we introduce MambaVL, a novel model that leverages recent advancements in selective state space modality fusion to efficiently capture long-range dependencies and learn joint representations for vision and language data. MambaVL utilizes a shared state transition matrix across both modalities, allowing the model to capture information about actions from multiple perspectives within the scene. Furthermore, we propose a question-answering task that helps guide the model toward relevant cues. These questions provide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducation and Technology Integration
