MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding
Zhiyi Zhu, Xiaoyu Wu, Zihao Liu, Linlin Yang

TL;DR
MLVTG introduces a novel multi-modal video temporal grounding framework using Mamba-based feature alignment and LLM-driven semantic purification, achieving state-of-the-art results on multiple datasets.
Contribution
The paper proposes MLVTG, combining MambaAligner and LLMRefiner modules to improve multi-modal alignment and temporal localization without extensive fine-tuning.
Findings
Achieves state-of-the-art performance on QVHighlights, Charades-STA, and TVSum datasets.
Outperforms existing baselines significantly in video temporal grounding tasks.
Demonstrates effective integration of structured state-space models and pre-trained LLMs.
Abstract
Video Temporal Grounding (VTG), which aims to localize video clips corresponding to natural language queries, is a fundamental yet challenging task in video understanding. Existing Transformer-based methods often suffer from redundant attention and suboptimal multi-modal alignment. To address these limitations, we propose MLVTG, a novel framework that integrates two key modules: MambaAligner and LLMRefiner. MambaAligner uses stacked Vision Mamba blocks as a backbone instead of Transformers to model temporal dependencies and extract robust video representations for multi-modal alignment. LLMRefiner leverages the specific frozen layer of a pre-trained Large Language Model (LLM) to implicitly transfer semantic priors, enhancing multi-modal alignment without fine-tuning. This dual alignment strategy, temporal modeling via structured state-space dynamics and semantic purification via textual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
MethodsSoftmax · Attention Is All You Need · Mamba: Linear-Time Sequence Modeling with Selective State Spaces
