When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding

Pengcheng Fang; Yuxia Chen; Rui Guo

arXiv:2508.15641·cs.CV·August 22, 2025

When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding

Pengcheng Fang, Yuxia Chen, Rui Guo

PDF

Open Access

TL;DR

This paper introduces Grounded VideoDiT, a Video LLM that improves temporal perception and entity grounding in long videos through diffusion-based encoding, explicit entity representation, and timestamp modeling, achieving state-of-the-art results.

Contribution

The paper proposes three novel components—Diffusion Temporal Latent encoder, object grounded representations, and mixed token scheme—for enhanced temporal and entity-aware video understanding.

Findings

01

Achieves state-of-the-art results on Charades STA, NExT GQA, and VideoQA benchmarks.

02

Demonstrates improved temporal boundary detection and entity alignment.

03

Enhances fine-grained temporal reasoning in long videos.

Abstract

Understanding videos requires more than answering open ended questions, it demands the ability to pinpoint when events occur and how entities interact across time. While recent Video LLMs have achieved remarkable progress in holistic reasoning, they remain coarse in temporal perception: timestamps are encoded only implicitly, frame level features are weak in capturing continuity, and language vision alignment often drifts from the entities of interest. In this paper, we present Grounded VideoDiT, a Video LLM designed to overcome these limitations by introducing three key innovations. First, a Diffusion Temporal Latent (DTL) encoder enhances boundary sensitivity and maintains temporal consistency. Second, object grounded representations explicitly bind query entities to localized visual evidence, strengthening alignment. Third, a mixed token scheme with discrete temporal tokens provides…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition