MASRA: MLLM-Assisted Semantic-Relational Consistent Alignment for Video Temporal Grounding

Ran Ran; Jiwei Wei; Shuchang Zhou; Yitong Qin; Shiyuan He; Zeyu Ma; Yuyang Zhou; and Yang Yang

arXiv:2605.03398·cs.CV·May 6, 2026

MASRA: MLLM-Assisted Semantic-Relational Consistent Alignment for Video Temporal Grounding

Ran Ran, Jiwei Wei, Shuchang Zhou, Yitong Qin, Shiyuan He, Zeyu Ma, Yuyang Zhou, and Yang Yang

PDF

TL;DR

MASRA is a training framework that uses multimodal large language models to improve video temporal grounding by aligning semantic and relational information, leading to better discriminability and consistency.

Contribution

The paper introduces MASRA, a novel training-time MLLM-assisted framework for VTG that enhances semantic and relational alignment without using MLLM during inference.

Findings

01

MASRA outperforms existing VTG methods in experiments.

02

Ablation studies confirm the effectiveness of each component.

03

MASRA improves span-level separability and temporal consistency.

Abstract

Video Temporal Grounding (VTG) faces a cross-modal semantic gap that often leads to background features being incorrectly aligned with the query, while directly matching the query to moments results in insufficient discriminability and consistency of temporal semantics. To address this issue, we propose MLLM-Assisted Semantic-Relational Consistent Alignment (MASRA), a training-time MLLM-based optimization framework for VTG. MASRA leverages an MLLM during training to produce two forms of textual priors, namely event-level descriptions with temporal spans and clip-level captions, and instantiates two MLLM-assisted alignments. Event Semantic Temporal Alignment (ESTA) aligns temporal context with event semantics to explicitly strengthen the correspondence between semantics and temporal events and improve span-level separability. Local Relational Consistency Alignment (LRCA) constructs a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.