TL;DR
TOC-Bench is a new benchmark designed to evaluate and diagnose the ability of Video-LLMs to maintain temporal object consistency across complex scenarios, revealing key weaknesses in current models.
Contribution
The paper introduces TOC-Bench, a structured, human-verified benchmark for assessing temporal object consistency in Video-LLMs, with a novel filtering protocol to ensure temporal dependency.
Findings
Current Video-LLMs struggle with object identity and event ordering.
Temporal object consistency is a major unresolved challenge for Video-LLMs.
TOC-Bench reveals weaknesses in event counting, ordering, and hallucination-aware reasoning.
Abstract
Video large language models (Video-LLMs) have made strong progress in general video understanding, but their ability to maintain temporal object consistency remains underexplored. Existing benchmarks often emphasize event recognition, action understanding, or coarse temporal reasoning, while rarely testing whether models can preserve the identity, state, and continuity of the same object across occlusion, disappearance, reappearance, state transitions, and cross-object interactions. We introduce TOC-Bench, a diagnostic benchmark for evaluating temporal object consistency in Video-LLMs. TOC-Bench is object-track grounded: each queried subject is linked to a per-frame trajectory and a structured temporal event timeline. To ensure that questions require temporally ordered visual evidence rather than language priors, single-frame shortcuts, or unordered frame cues, we design a three-layer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
