When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding
Pengcheng Fang, Yuxia Chen, Rui Guo

TL;DR
This paper introduces Grounded VideoDiT, a Video LLM that improves temporal perception and entity grounding in long videos through diffusion-based encoding, explicit entity representation, and timestamp modeling, achieving state-of-the-art results.
Contribution
The paper proposes three novel components—Diffusion Temporal Latent encoder, object grounded representations, and mixed token scheme—for enhanced temporal and entity-aware video understanding.
Findings
Achieves state-of-the-art results on Charades STA, NExT GQA, and VideoQA benchmarks.
Demonstrates improved temporal boundary detection and entity alignment.
Enhances fine-grained temporal reasoning in long videos.
Abstract
Understanding videos requires more than answering open ended questions, it demands the ability to pinpoint when events occur and how entities interact across time. While recent Video LLMs have achieved remarkable progress in holistic reasoning, they remain coarse in temporal perception: timestamps are encoded only implicitly, frame level features are weak in capturing continuity, and language vision alignment often drifts from the entities of interest. In this paper, we present Grounded VideoDiT, a Video LLM designed to overcome these limitations by introducing three key innovations. First, a Diffusion Temporal Latent (DTL) encoder enhances boundary sensitivity and maintains temporal consistency. Second, object grounded representations explicitly bind query entities to localized visual evidence, strengthening alignment. Third, a mixed token scheme with discrete temporal tokens provides…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
