TL;DR
ViS4mer is an efficient long-range video model combining Transformer and S4 layers, enabling better long movie classification with less computation and memory, achieving state-of-the-art results on multiple benchmarks.
Contribution
The paper introduces ViS4mer, a novel model that integrates Transformer and structured state-space layers for scalable long-range video understanding.
Findings
ViS4mer is 2.63x faster than pure self-attention models.
It requires 8x less GPU memory.
Achieves state-of-the-art results on 6 out of 9 long video classification tasks.
Abstract
Most modern video recognition models are designed to operate on short video clips (e.g., 5-10s in length). Thus, it is challenging to apply such models to long movie understanding tasks, which typically require sophisticated long-range temporal reasoning. The recently introduced video transformers partially address this issue by using long-range temporal self-attention. However, due to the quadratic cost of self-attention, such models are often costly and impractical to use. Instead, we propose ViS4mer, an efficient long-range video model that combines the strengths of self-attention and the recently introduced structured state-space sequence (S4) layer. Our model uses a standard Transformer encoder for short-range spatiotemporal feature extraction, and a multi-scale temporal S4 decoder for subsequent long-range temporal reasoning. By progressively reducing the spatiotemporal feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Softmax · Layer Normalization · Label Smoothing · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections
