ClipTBP: Clip-Pair based Temporal Boundary Prediction with Boundary-Aware Learning for Moment Retrieval

Ji-Hyeon Kim; Ho-Joong Kim; Seong-Whan Lee

arXiv:2604.27591·cs.CV·May 1, 2026

ClipTBP: Clip-Pair based Temporal Boundary Prediction with Boundary-Aware Learning for Moment Retrieval

Ji-Hyeon Kim, Ho-Joong Kim, Seong-Whan Lee

PDF

TL;DR

ClipTBP is a novel framework for video moment retrieval that enhances boundary prediction accuracy and semantic alignment by considering relationships between multiple answer segments and employing boundary-aware learning.

Contribution

It introduces clip-level alignment loss and auxiliary boundary loss, improving robustness and performance over existing models in ambiguous query scenarios.

Findings

01

Consistently improves performance across various models.

02

Demonstrates more robust boundary prediction in ambiguous scenarios.

03

Enhances semantic relationship learning between answer segments.

Abstract

Video moment retrieval is the task of retrieving specific segments of a video corresponding to a given text query. Recent studies have been conducted to improve multimodal alignment performance through visual-linguistic similarity learning at the snippet-level and transformer-based temporal boundary regression. However, existing models do not calculate similarity by considering the relationships between multiple answer segments that match the query. Therefore, existing models are easily influenced by visually similar segments in the surrounding context. Existing models calculate similarity at the snippet-level and ignore the relationships between multiple answer segments corresponding to a single query. Therefore, they struggle to exclude segments irrelevant to the query. To address this issues, we propose ClipTBP, a clip-pair temporal boundary prediction framework based on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.