Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models
Tri Cao, Khoi Le, Thong Nguyen, Cong-Duy Nguyen, Quynh Vo, Anh Tuan Luu, Chunyan Miao, See-Kiong Ng, Shuicheng Yan, Bryan Hooi

TL;DR
This paper introduces STEMO-Bench, a new benchmark for evaluating spatio-temporal reasoning in video large language models, and proposes STEMO-Track, an object-centric framework that enhances reasoning accuracy and reduces hallucinations.
Contribution
The paper presents a novel benchmark for detailed spatio-temporal reasoning and an object-centric framework that improves model accuracy and consistency in video understanding tasks.
Findings
STEMO-Track significantly reduces hallucinated answers.
Models show improved spatio-temporal reasoning on STEMO-Bench.
Object-centric reasoning enhances video understanding accuracy.
Abstract
While multimodal large language models (MLLMs) have advanced video understanding, they remain highly prone to hallucinations in dynamic scenes. We argue this stems from a failure in spatio-temporal monitoring, the ability to persistently track object identities, states, and relations over time. Existing benchmarks obscure this deficit by relying on single final-answer evaluations for queries that can often be resolved via local visual cues or statistical priors. To rigorously diagnose this, we introduce STEMO-Bench (Spatio-TEmporal MOnitoring), a benchmark of human-verified object-centric facts that evaluates intermediate reasoning by decomposing queries into sub-questions, distinguishing genuine temporal understanding from coincidental correctness. To address failure modes exposed by STEMO, we propose STEMO-Track, a novel object-centric framework that explicitly constructs and reasons…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
