TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Jun Zhang; Teng Wang; Yuying Ge; Yixiao Ge; Xinhao Li; Ying Shan; Limin Wang

arXiv:2512.14698·cs.CV·March 27, 2026

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, Limin Wang

PDF

Open Access 2 Models 2 Datasets

TL;DR

TimeLens systematically improves video temporal grounding by addressing data quality issues and exploring algorithmic design, resulting in models that outperform existing solutions and setting new standards in open-source VTG performance.

Contribution

The paper introduces TimeLens, a comprehensive framework that enhances VTG by re-annotating benchmarks, creating high-quality training data, and developing effective algorithmic strategies.

Findings

01

TimeLens models achieve state-of-the-art VTG performance.

02

Re-annotated benchmarks reveal unreliability of previous evaluations.

03

High-quality training data significantly improve model accuracy.

Abstract

This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization