A Survey on Video Temporal Grounding with Multimodal Large Language Model

Jianlong Wu; Wei Liu; Ye Liu; Meng Liu; Liqiang Nie; Zhouchen Lin; Chang Wen Chen

arXiv:2508.10922·cs.CV·August 18, 2025

A Survey on Video Temporal Grounding with Multimodal Large Language Model

Jianlong Wu, Wei Liu, Ye Liu, Meng Liu, Liqiang Nie, Zhouchen Lin, Chang Wen Chen

PDF

TL;DR

This survey reviews recent developments in video temporal grounding using multimodal large language models, emphasizing their architecture, training strategies, and feature processing, highlighting their superior zero-shot and multi-task performance.

Contribution

It provides a comprehensive taxonomy and analysis of VTG-MLLMs, addressing a gap in focused reviews and outlining future research directions.

Findings

01

VTG-MLLMs outperform traditional methods in zero-shot and multi-domain tasks.

02

Current benchmarks and evaluation protocols are summarized.

03

Identifies limitations and proposes future research avenues.

Abstract

The recent advancement in video temporal grounding (VTG) has significantly enhanced fine-grained video understanding, primarily driven by multimodal large language models (MLLMs). With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods. They not only achieve competitive performance but also excel in generalization across zero-shot, multi-task, and multi-domain settings. Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce. To fill this gap, this survey systematically examines current research on VTG-MLLMs through a three-dimensional taxonomy: 1) the functional roles of MLLMs, highlighting their architectural significance; 2) training paradigms, analyzing strategies for temporal reasoning and task…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.