MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval

Seojeong Park; Jiho Choi; Kyungjune Baek; Hyunjung Shim

arXiv:2412.20816·cs.CV·February 27, 2026

MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval

Seojeong Park, Jiho Choi, Kyungjune Baek, Hyunjung Shim

PDF

Open Access 1 Repo

TL;DR

This paper introduces MomentMix augmentation and a length-aware decoder to improve the localization of short video moments in retrieval tasks, significantly enhancing performance of DETR-based models.

Contribution

It proposes MomentMix augmentation strategies and a length-aware decoder to address short moment localization challenges in video retrieval.

Findings

01

Outperforms state-of-the-art DETR-based methods on benchmark datasets.

02

Achieves 9.62% gain in [email protected] on QVHighlights.

03

Improves mAP by 16.9% on QVHighlights.

Abstract

Video Moment Retrieval (MR) aims to localize moments within a video based on a given natural language query. Given the prevalent use of platforms like YouTube for information retrieval, the demand for MR techniques is significantly growing. Recent DETR-based models have made notable advances in performance but still struggle with accurately localizing short moments. Through data analysis, we identified limited feature diversity in short moments, which motivated the development of MomentMix. MomentMix generates new short-moment samples by employing two augmentation strategies: ForegroundMix and BackgroundMix, each enhancing the ability to understand the query-relevant and irrelevant frames, respectively. Additionally, our analysis of prediction bias revealed that short moments particularly struggle with accurately predicting their center positions and length of moments. To address this,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sjpark5800/la-detr
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Multimodal Machine Learning Applications