Learning from Untrimmed Videos: Self-Supervised Video Representation   Learning with Hierarchical Consistency

Zhiwu Qing; Shiwei Zhang; Ziyuan Huang; Yi Xu; Xiang Wang; Mingqian; Tang; Changxin Gao; Rong Jin; Nong Sang

arXiv:2204.03017·cs.CV·April 8, 2022

Learning from Untrimmed Videos: Self-Supervised Video Representation Learning with Hierarchical Consistency

Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yi Xu, Xiang Wang, Mingqian, Tang, Changxin Gao, Rong Jin, Nong Sang

PDF

Open Access

TL;DR

This paper introduces HiCo, a hierarchical consistency learning framework that leverages untrimmed videos by capturing visual and topical consistencies, leading to improved video representations over traditional methods.

Contribution

The paper proposes a novel hierarchical consistency learning framework, HiCo, that effectively utilizes untrimmed videos for self-supervised representation learning, surpassing existing trimmed-video-based approaches.

Findings

01

HiCo produces stronger video representations from untrimmed videos.

02

It improves representation quality when applied to trimmed videos.

03

Hierarchical consistency learning outperforms standard contrastive methods.

Abstract

Natural videos provide rich visual contents for self-supervised learning. Yet most existing approaches for learning spatio-temporal representations rely on manually trimmed videos, leading to limited diversity in visual patterns and limited performance gain. In this work, we aim to learn representations by leveraging more abundant information in untrimmed videos. To this end, we propose to learn a hierarchy of consistencies in videos, i.e., visual consistency and topical consistency, corresponding respectively to clip pairs that tend to be visually similar when separated by a short time span and share similar topics when separated by a long time span. Specifically, a hierarchical consistency learning framework HiCo is presented, where the visually consistent pairs are encouraged to have the same representation through contrastive learning, while the topically consistent pairs are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Cancer-related molecular mechanisms research · Multimodal Machine Learning Applications

MethodsContrastive Learning · Contrastive Language-Image Pre-training