Multi-proposal Collaboration and Multi-task Training for Weakly-supervised Video Moment Retrieval
Bolin Zhang, Chao Yang, Bin Jiang, Takahiro Komamizu, Ichiro Ide

TL;DR
This paper introduces MCMT, a novel weakly-supervised approach for Video Moment Retrieval that generates multiple proposals, uses Gaussian masks, and employs multi-task training for improved accuracy and stability.
Contribution
The paper proposes a new method combining multi-proposal generation, Gaussian masks, and multi-task training to enhance weakly-supervised VMR performance and stability.
Findings
Outperforms existing methods on standard benchmarks
Produces high-quality temporal proposals and stable retrieval results
Effectively distinguishes relevant video segments without temporal annotations
Abstract
This study focuses on weakly-supervised Video Moment Retrieval (VMR), aiming to identify a moment semantically similar to the given query within an untrimmed video using only video-level correspondences, without relying on temporal annotations during training. Previous methods either aggregate predictions for all instances in the video, or indirectly address the task by proposing reconstructions for the query. However, these methods often produce low-quality temporal proposals, struggle with distinguishing misaligned moments in the same video, or lack stability due to a reliance on a single auxiliary task. To address these limitations, we present a novel weakly-supervised method called Multi-proposal Collaboration and Multi-task Training (MCMT). Initially, we generate multiple proposals and derive corresponding learnable Gaussian masks from them. These masks are then combined to create…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
