Snakes and Ladders: Two Steps Up for VideoMamba
Hui Lu, Albert Ali Salah, Ronald Poppe

TL;DR
This paper introduces VideoMambaPro, an improved video understanding model that addresses limitations of Mamba in vision tasks, achieving higher accuracy and efficiency in video classification benchmarks.
Contribution
It provides a theoretical analysis of Mamba's limitations and proposes VideoMambaPro with novel techniques that outperform previous models on key datasets.
Findings
VideoMambaPro surpasses VideoMamba by 1.6-2.8% on Kinetics-400
VideoMambaPro exceeds VideoMamba by 1.1-1.9% on Something-Something V2
Models perform well even without extensive pre-training
Abstract
Video understanding requires the extraction of rich spatio-temporal representations, which transformer models achieve through self-attention. Unfortunately, self-attention poses a computational burden. In NLP, Mamba has surfaced as an efficient alternative for transformers. However, Mamba's successes do not trivially extend to vision tasks, including those in video analysis. In this paper, we theoretically analyze the differences between self-attention and Mamba. We identify two limitations in Mamba's token processing: historical decay and element contradiction. We propose VideoMambaPro (VMP) that solves the identified limitations by adding masked backward computation and elemental residual connections to a VideoMamba backbone. Differently sized VideoMambaPro models surpass VideoMamba by 1.6-2.8% and 1.1-1.9% top-1 on Kinetics-400 and Something-Something V2, respectively. Even without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCinema and Media Studies
