Unsupervised Temporal Video Grounding with Deep Semantic Clustering

Daizong Liu; Xiaoye Qu; Yinzhen Wang; Xing Di; Kai Zou; Yu Cheng,; Zichuan Xu; Pan Zhou

arXiv:2201.05307·cs.CV·January 17, 2022

Unsupervised Temporal Video Grounding with Deep Semantic Clustering

Daizong Liu, Xiaoye Qu, Yinzhen Wang, Xing Di, Kai Zou, Yu Cheng,, Zichuan Xu, Pan Zhou

PDF

Open Access 1 Video

TL;DR

This paper introduces an unsupervised deep learning approach for temporal video grounding that leverages semantic clustering to localize video segments without relying on paired annotations.

Contribution

It presents the first unsupervised model for TVG, using semantic mining and aggregation modules to effectively utilize unpaired query data.

Findings

01

Achieves competitive results on ActivityNet Captions and Charades-STA datasets.

02

Outperforms most weakly-supervised methods in unsupervised TVG.

03

Demonstrates the feasibility of unsupervised video grounding with semantic clustering.

Abstract

Temporal video grounding (TVG) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant video-query paired data, which is expensive and time-consuming to collect in real-world scenarios. In this paper, we explore whether a video grounding model can be learned without any paired annotations. To the best of our knowledge, this paper is the first work trying to address TVG in an unsupervised setting. Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set to compose the possible activity in each video for grounding. Specifically, we first develop a language semantic mining module, which extracts implicit semantic features from the whole query set. Then,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Unsupervised Temporal Video Grounding with Deep Semantic Clustering· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization