MASRA: MLLM-Assisted Semantic-Relational Consistent Alignment for Video Temporal Grounding
Ran Ran, Jiwei Wei, Shuchang Zhou, Yitong Qin, Shiyuan He, Zeyu Ma, Yuyang Zhou, and Yang Yang

TL;DR
MASRA is a training framework that uses multimodal large language models to improve video temporal grounding by aligning semantic and relational information, leading to better discriminability and consistency.
Contribution
The paper introduces MASRA, a novel training-time MLLM-assisted framework for VTG that enhances semantic and relational alignment without using MLLM during inference.
Findings
MASRA outperforms existing VTG methods in experiments.
Ablation studies confirm the effectiveness of each component.
MASRA improves span-level separability and temporal consistency.
Abstract
Video Temporal Grounding (VTG) faces a cross-modal semantic gap that often leads to background features being incorrectly aligned with the query, while directly matching the query to moments results in insufficient discriminability and consistency of temporal semantics. To address this issue, we propose MLLM-Assisted Semantic-Relational Consistent Alignment (MASRA), a training-time MLLM-based optimization framework for VTG. MASRA leverages an MLLM during training to produce two forms of textual priors, namely event-level descriptions with temporal spans and clip-level captions, and instantiates two MLLM-assisted alignments. Event Semantic Temporal Alignment (ESTA) aligns temporal context with event semantics to explicitly strengthen the correspondence between semantics and temporal events and improve span-level separability. Local Relational Consistency Alignment (LRCA) constructs a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
