DisTime: Distribution-based Time Representation for Video Large Language Models

Yingsen Zeng; Zepeng Huang; Yujie Zhong; Chengjian Feng; Jie Hu; Lin Ma; Yang Liu

arXiv:2505.24329·cs.CV·August 1, 2025

DisTime: Distribution-based Time Representation for Video Large Language Models

Yingsen Zeng, Zepeng Huang, Yujie Zhong, Chengjian Feng, Jie Hu, Lin Ma, Yang Liu

PDF

Open Access 1 Repo 2 Models 1 Datasets

TL;DR

DisTime introduces a novel distribution-based temporal representation framework for Video-LLMs, improving temporal localization and understanding by using continuous embeddings and a new dataset with extensive temporally grounded annotations.

Contribution

The paper presents DisTime, a lightweight framework with a distribution-based time decoder and encoder, along with a large annotated dataset, to enhance temporal comprehension in Video-LLMs.

Findings

01

Achieves state-of-the-art results on multiple temporal tasks.

02

Creates InternVid-TG, a dataset with 1.25 million temporally grounded events.

03

Maintains competitive performance in Video QA tasks.

Abstract

Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. DisTime employs a learnable token to create a continuous temporal embedding space and incorporates a Distribution-based Time Decoder that generates temporal probability distributions, effectively mitigating boundary ambiguities and maintaining temporal continuity. Additionally, the Distribution-based Time Encoder re-encodes timestamps to provide time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

josephzpng/distime
pytorchOfficial

Models

Datasets

yingsen/internvid-tg
dataset· 982 dl
982 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Pose and Action Recognition

MethodsAttentive Walk-Aggregating Graph Neural Network