MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding

Zhiyi Zhu; Xiaoyu Wu; Zihao Liu; Linlin Yang

arXiv:2506.08512·cs.CV·January 28, 2026

MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding

Zhiyi Zhu, Xiaoyu Wu, Zihao Liu, Linlin Yang

PDF

Open Access

TL;DR

MLVTG introduces a novel multi-modal video temporal grounding framework using Mamba-based feature alignment and LLM-driven semantic purification, achieving state-of-the-art results on multiple datasets.

Contribution

The paper proposes MLVTG, combining MambaAligner and LLMRefiner modules to improve multi-modal alignment and temporal localization without extensive fine-tuning.

Findings

01

Achieves state-of-the-art performance on QVHighlights, Charades-STA, and TVSum datasets.

02

Outperforms existing baselines significantly in video temporal grounding tasks.

03

Demonstrates effective integration of structured state-space models and pre-trained LLMs.

Abstract

Video Temporal Grounding (VTG), which aims to localize video clips corresponding to natural language queries, is a fundamental yet challenging task in video understanding. Existing Transformer-based methods often suffer from redundant attention and suboptimal multi-modal alignment. To address these limitations, we propose MLVTG, a novel framework that integrates two key modules: MambaAligner and LLMRefiner. MambaAligner uses stacked Vision Mamba blocks as a backbone instead of Transformers to model temporal dependencies and extract robust video representations for multi-modal alignment. LLMRefiner leverages the specific frozen layer of a pre-trained Large Language Model (LLM) to implicitly transfer semantic priors, enhancing multi-modal alignment without fine-tuning. This dual alignment strategy, temporal modeling via structured state-space dynamics and semantic purification via textual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition

MethodsSoftmax · Attention Is All You Need · Mamba: Linear-Time Sequence Modeling with Selective State Spaces