Efficient Movie Scene Detection using State-Space Transformers
Md Mohaiminul Islam, Mahmudul Hasan, Kishan Shamsundar Athrey, Tony, Braskich, Gedas Bertasius

TL;DR
This paper introduces TranS4mer, a novel state-space transformer model that efficiently captures long-range dependencies for accurate movie scene detection, outperforming prior methods while being faster and more memory-efficient.
Contribution
The paper presents TranS4mer, a new model combining structured state-space and self-attention layers for improved long-range video analysis in scene detection.
Findings
Outperforms prior methods on MovieNet, BBC, and OVSD datasets.
Runs twice as fast and uses three times less GPU memory than standard Transformers.
Effectively captures both intra- and inter-shot dependencies in long movie videos.
Abstract
The ability to distinguish between different movie scenes is critical for understanding the storyline of a movie. However, accurately detecting movie scenes is often challenging as it requires the ability to reason over very long movie segments. This is in contrast to most existing video recognition models, which are typically designed for short-range video analysis. This work proposes a State-Space Transformer model that can efficiently capture dependencies in long movie videos for accurate movie scene detection. Our model, dubbed TranS4mer, is built using a novel S4A building block, which combines the strengths of structured state-space sequence (S4) and self-attention (A) layers. Given a sequence of frames divided into movie shots (uninterrupted periods where the camera position does not change), the S4A block first applies self-attention to capture short-range intra-shot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Absolute Position Encodings · Linear Layer · Adam · Layer Normalization · Softmax · Byte Pair Encoding · Residual Connection · Label Smoothing
