Snakes and Ladders: Two Steps Up for VideoMamba

Hui Lu; Albert Ali Salah; Ronald Poppe

arXiv:2406.19006·cs.CV·November 14, 2024

Snakes and Ladders: Two Steps Up for VideoMamba

Hui Lu, Albert Ali Salah, Ronald Poppe

PDF

Open Access 1 Repo

TL;DR

This paper introduces VideoMambaPro, an improved video understanding model that addresses limitations of Mamba in vision tasks, achieving higher accuracy and efficiency in video classification benchmarks.

Contribution

It provides a theoretical analysis of Mamba's limitations and proposes VideoMambaPro with novel techniques that outperform previous models on key datasets.

Findings

01

VideoMambaPro surpasses VideoMamba by 1.6-2.8% on Kinetics-400

02

VideoMambaPro exceeds VideoMamba by 1.1-1.9% on Something-Something V2

03

Models perform well even without extensive pre-training

Abstract

Video understanding requires the extraction of rich spatio-temporal representations, which transformer models achieve through self-attention. Unfortunately, self-attention poses a computational burden. In NLP, Mamba has surfaced as an efficient alternative for transformers. However, Mamba's successes do not trivially extend to vision tasks, including those in video analysis. In this paper, we theoretically analyze the differences between self-attention and Mamba. We identify two limitations in Mamba's token processing: historical decay and element contradiction. We propose VideoMambaPro (VMP) that solves the identified limitations by adding masked backward computation and elemental residual connections to a VideoMamba backbone. Differently sized VideoMambaPro models surpass VideoMamba by 1.6-2.8% and 1.1-1.9% top-1 on Kinetics-400 and Something-Something V2, respectively. Even without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hotfinda/videomambapro
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCinema and Media Studies