Multi-scale 2D Representation Learning for weakly-supervised moment retrieval
Ding Li, Rui Wu, Yongqiang Tang, Zhizhong Zhang, Wensheng Zhang

TL;DR
This paper introduces a multi-scale 2D representation learning approach for weakly-supervised video moment retrieval, effectively capturing temporal dependencies across scales to improve retrieval accuracy without needing detailed annotations.
Contribution
It proposes a novel 2D map construction for temporal dependencies and a candidate selection method, advancing weakly-supervised video retrieval techniques.
Findings
Achieves superior performance on Charades-STA and ActivityNet Captions datasets.
Effectively models temporal dependencies across multiple scales.
Outperforms state-of-the-art weakly-supervised methods.
Abstract
Video moment retrieval aims to search the moment most relevant to a given language query. However, most existing methods in this community often require temporal boundary annotations which are expensive and time-consuming to label. Hence weakly supervised methods have been put forward recently by only using coarse video-level label. Despite effectiveness, these methods usually process moment candidates independently, while ignoring a critical issue that the natural temporal dependencies between candidates in different temporal scales. To cope with this issue, we propose a Multi-scale 2D Representation Learning method for weakly supervised video moment retrieval. Specifically, we first construct a two-dimensional map for each temporal scale to capture the temporal dependencies between candidates. Two dimensions in this map indicate the start and end time points of these candidates. Then,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
