ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos

Wenliang Guo; and Yu Kong

arXiv:2512.03479·cs.CV·May 11, 2026

ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos

Wenliang Guo, and Yu Kong

PDF

TL;DR

ProcObject-10K is a new benchmark for evaluating object-centric reasoning and temporal grounding in instructional videos, revealing significant gaps in current models' understanding of object dynamics.

Contribution

This work introduces the first benchmark focused on object-centric procedural reasoning in videos, along with a fine-tuning baseline that improves model performance and transferability.

Findings

01

Models produce plausible answers but poorly localize supporting evidence (mIoU < 45%)

02

Fine-tuning on ProcObject-10K enhances performance on related tasks

03

Benchmark exposes reliance on linguistic priors over object dynamics

Abstract

Procedural activities are fundamentally driven by object state transitions, yet existing instructional video benchmarks remain action-centric and cannot evaluate whether models reason about how objects evolve toward task completion. In this work, we introduce ProcObject-10K, the first benchmark that jointly evaluates object-centric reasoning and temporal evidence grounding in instructional videos, across both egocentric and exocentric views. It comprises 10,522 open-ended VideoQA pairs grounded in 1,799 video clips, spanning 137 tasks across 9 domains and five reasoning types covering preconditions, state evolution, counterfactuals, mistakes, and readiness. Benchmarking 13 leading MLLMs reveals a substantial answering-grounding gap: models produce plausible answers while failing to localize the supporting evidence (mIoU < 45%), exposing their reliance on linguistic priors rather than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.