ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos
Wenliang Guo, and Yu Kong

TL;DR
ProcObject-10K is a new benchmark for evaluating object-centric reasoning and temporal grounding in instructional videos, revealing significant gaps in current models' understanding of object dynamics.
Contribution
This work introduces the first benchmark focused on object-centric procedural reasoning in videos, along with a fine-tuning baseline that improves model performance and transferability.
Findings
Models produce plausible answers but poorly localize supporting evidence (mIoU < 45%)
Fine-tuning on ProcObject-10K enhances performance on related tasks
Benchmark exposes reliance on linguistic priors over object dynamics
Abstract
Procedural activities are fundamentally driven by object state transitions, yet existing instructional video benchmarks remain action-centric and cannot evaluate whether models reason about how objects evolve toward task completion. In this work, we introduce ProcObject-10K, the first benchmark that jointly evaluates object-centric reasoning and temporal evidence grounding in instructional videos, across both egocentric and exocentric views. It comprises 10,522 open-ended VideoQA pairs grounded in 1,799 video clips, spanning 137 tasks across 9 domains and five reasoning types covering preconditions, state evolution, counterfactuals, mistakes, and readiness. Benchmarking 13 leading MLLMs reveals a substantial answering-grounding gap: models produce plausible answers while failing to localize the supporting evidence (mIoU < 45%), exposing their reliance on linguistic priors rather than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
