Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding

Yunzhuo Sun; Xinyue Liu; Yanyang Li; Nanding Wu; Yifang Xu; Linlin Zong; Xianchao Zhang; Wenxin Liang

arXiv:2603.22121·cs.CV·March 24, 2026

Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding

Yunzhuo Sun, Xinyue Liu, Yanyang Li, Nanding Wu, Yifang Xu, Linlin Zong, Xianchao Zhang, Wenxin Liang

PDF

Open Access

TL;DR

Mamba-VMR introduces a two-stage multimodal framework that leverages generated videos and subtitle context to improve the accuracy and efficiency of text-driven video moment retrieval in long sequences.

Contribution

The paper proposes a novel two-stage approach combining subtitle-guided auxiliary video generation and a multi-modal network for enhanced temporal grounding, addressing limitations of previous methods.

Findings

01

Significant improvement in recall on the TVR benchmark.

02

Reduced computational costs compared to existing models.

03

Effective integration of generated motion priors and subtitle context.

Abstract

Text-driven video moment retrieval (VMR) remains challenging due to limited capture of hidden temporal dynamics in untrimmed videos, leading to imprecise grounding in long sequences. Traditional methods rely on natural language queries (NLQs) or static image augmentations, overlooking motion sequences and suffering from high computational costs in Transformer-based architectures. Existing approaches fail to integrate subtitle contexts and generated temporal priors effectively, we therefore propose a novel two-stage framework for enhanced temporal grounding. In the first stage, LLM-guided subtitle matching identifies relevant textual cues from video subtitles, fused with the query to generate auxiliary short videos via text-to-video models, capturing implicit motion information as temporal priors. In the second stage, augmented queries are processed through a multi-modal controlled Mamba…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques