Scene Consistency Representation Learning for Video Scene Segmentation
Haoqian Wu, Keyu Chen, Yanan Luo, Ruizhi Qiao, Bo Ren, Haozhe Liu,, Weicheng Xie, Linlin Shen

TL;DR
This paper introduces a self-supervised learning framework that enhances shot representations for long-term video scene segmentation, achieving state-of-the-art results without relying on explicit boundary annotations.
Contribution
It proposes a novel SSL scheme for scene consistency, utilizing data augmentation and a less biased temporal model, along with a new benchmark for fair evaluation.
Findings
Achieved state-of-the-art performance on video scene segmentation
Introduced a self-supervised approach for shot representation learning
Provided a more fair benchmark for evaluating segmentation methods
Abstract
A long-term video, such as a movie or TV show, is composed of various scenes, each of which represents a series of shots sharing the same semantic story. Spotting the correct scene boundary from the long-term video is a challenging task, since a model must understand the storyline of the video to figure out where a scene starts and ends. To this end, we propose an effective Self-Supervised Learning (SSL) framework to learn better shot representations from unlabeled long-term videos. More specifically, we present an SSL scheme to achieve scene consistency, while exploring considerable data augmentation and shuffling methods to boost the model generalizability. Instead of explicitly learning the scene boundary features as in the previous methods, we introduce a vanilla temporal model with less inductive bias to verify the quality of the shot features. Our method achieves the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Image and Video Quality Assessment · Human Pose and Action Recognition
