Long Movie Clip Classification with State-Space Video Models

Md Mohaiminul Islam; Gedas Bertasius

arXiv:2204.01692·cs.CV·January 5, 2023

Long Movie Clip Classification with State-Space Video Models

Md Mohaiminul Islam, Gedas Bertasius

PDF

1 Repo

TL;DR

ViS4mer is an efficient long-range video model combining Transformer and S4 layers, enabling better long movie classification with less computation and memory, achieving state-of-the-art results on multiple benchmarks.

Contribution

The paper introduces ViS4mer, a novel model that integrates Transformer and structured state-space layers for scalable long-range video understanding.

Findings

01

ViS4mer is 2.63x faster than pure self-attention models.

02

It requires 8x less GPU memory.

03

Achieves state-of-the-art results on 6 out of 9 long video classification tasks.

Abstract

Most modern video recognition models are designed to operate on short video clips (e.g., 5-10s in length). Thus, it is challenging to apply such models to long movie understanding tasks, which typically require sophisticated long-range temporal reasoning. The recently introduced video transformers partially address this issue by using long-range temporal self-attention. However, due to the quadratic cost of self-attention, such models are often costly and impractical to use. Instead, we propose ViS4mer, an efficient long-range video model that combines the strengths of self-attention and the recently introduced structured state-space sequence (S4) layer. Our model uses a standard Transformer encoder for short-range spatiotemporal feature extraction, and a multi-scale temporal S4 decoder for subsequent long-range temporal reasoning. By progressively reducing the spatiotemporal feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

md-mohaiminul/ViS4mer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Softmax · Layer Normalization · Label Smoothing · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections