Multi-proposal Collaboration and Multi-task Training for Weakly-supervised Video Moment Retrieval

Bolin Zhang; Chao Yang; Bin Jiang; Takahiro Komamizu; Ichiro Ide

arXiv:2605.14838·cs.CV·May 15, 2026

Multi-proposal Collaboration and Multi-task Training for Weakly-supervised Video Moment Retrieval

Bolin Zhang, Chao Yang, Bin Jiang, Takahiro Komamizu, Ichiro Ide

PDF

TL;DR

This paper introduces MCMT, a novel weakly-supervised approach for Video Moment Retrieval that generates multiple proposals, uses Gaussian masks, and employs multi-task training for improved accuracy and stability.

Contribution

The paper proposes a new method combining multi-proposal generation, Gaussian masks, and multi-task training to enhance weakly-supervised VMR performance and stability.

Findings

01

Outperforms existing methods on standard benchmarks

02

Produces high-quality temporal proposals and stable retrieval results

03

Effectively distinguishes relevant video segments without temporal annotations

Abstract

This study focuses on weakly-supervised Video Moment Retrieval (VMR), aiming to identify a moment semantically similar to the given query within an untrimmed video using only video-level correspondences, without relying on temporal annotations during training. Previous methods either aggregate predictions for all instances in the video, or indirectly address the task by proposing reconstructions for the query. However, these methods often produce low-quality temporal proposals, struggle with distinguishing misaligned moments in the same video, or lack stability due to a reliance on a single auxiliary task. To address these limitations, we present a novel weakly-supervised method called Multi-proposal Collaboration and Multi-task Training (MCMT). Initially, we generate multiple proposals and derive corresponding learnable Gaussian masks from them. These masks are then combined to create…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.