Self-Supervised Video Representation Learning by Video Incoherence   Detection

Haozhi Cao; Yuecong Xu; Jianfei Yang; Kezhi Mao; Lihua Xie; Jianxiong; Yin; Simon See

arXiv:2109.12493·cs.CV·September 28, 2021·1 cites

Self-Supervised Video Representation Learning by Video Incoherence Detection

Haozhi Cao, Yuecong Xu, Jianfei Yang, Kezhi Mao, Lihua Xie, Jianxiong, Yin, Simon See

PDF

Open Access

TL;DR

This paper presents a self-supervised video representation learning approach that detects incoherence within videos, enabling the model to understand high-level video semantics and improve performance on action recognition and retrieval tasks.

Contribution

It introduces a novel incoherence detection framework combined with intra-video contrastive learning for enhanced self-supervised video representation learning.

Findings

01

Achieves state-of-the-art results on multiple datasets.

02

Outperforms previous coherence-based methods.

03

Effective across various backbone networks.

Abstract

This paper introduces a novel self-supervised method that leverages incoherence detection for video representation learning. It roots from the observation that visual systems of human beings can easily identify video incoherence based on their comprehensive understanding of videos. Specifically, the training sample, denoted as the incoherent clip, is constructed by multiple sub-clips hierarchically sampled from the same raw video with various lengths of incoherence between each other. The network is trained to learn high-level representation by predicting the location and length of incoherence given the incoherent clip as input. Additionally, intra-video contrastive learning is introduced to maximize the mutual information between incoherent clips from the same raw video. We evaluate our proposed method through extensive experiments on action recognition and video retrieval utilizing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization

MethodsContrastive Learning