ClipTBP: Clip-Pair based Temporal Boundary Prediction with Boundary-Aware Learning for Moment Retrieval
Ji-Hyeon Kim, Ho-Joong Kim, Seong-Whan Lee

TL;DR
ClipTBP is a novel framework for video moment retrieval that enhances boundary prediction accuracy and semantic alignment by considering relationships between multiple answer segments and employing boundary-aware learning.
Contribution
It introduces clip-level alignment loss and auxiliary boundary loss, improving robustness and performance over existing models in ambiguous query scenarios.
Findings
Consistently improves performance across various models.
Demonstrates more robust boundary prediction in ambiguous scenarios.
Enhances semantic relationship learning between answer segments.
Abstract
Video moment retrieval is the task of retrieving specific segments of a video corresponding to a given text query. Recent studies have been conducted to improve multimodal alignment performance through visual-linguistic similarity learning at the snippet-level and transformer-based temporal boundary regression. However, existing models do not calculate similarity by considering the relationships between multiple answer segments that match the query. Therefore, existing models are easily influenced by visually similar segments in the surrounding context. Existing models calculate similarity at the snippet-level and ignore the relationships between multiple answer segments corresponding to a single query. Therefore, they struggle to exclude segments irrelevant to the query. To address this issues, we propose ClipTBP, a clip-pair temporal boundary prediction framework based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
