Learning Procedural-aware Video Representations through State-Grounded Hierarchy Unfolding
Jinghan Zhao, Yifei Huang, Feng Lu

TL;DR
This paper introduces a novel hierarchical framework for learning procedural video representations by grounding abstract tasks and steps in observable object states, improving performance on various recognition and prediction tasks.
Contribution
The paper proposes the Task-Step-State (TSS) framework and a progressive pre-training strategy to better align video representations with concrete visual states, advancing procedural understanding in videos.
Findings
Outperforms baseline models on COIN and CrossTask datasets.
State supervision significantly improves task and step recognition.
Progressive pretraining outperforms standard joint training.
Abstract
Learning procedural-aware video representations is a key step towards building agents that can reason about and execute complex tasks. Existing methods typically address this problem by aligning visual content with textual descriptions at the task and step levels to inject procedural semantics into video representations. However, due to their high level of abstraction, 'task' and 'step' descriptions fail to form a robust alignment with the concrete, observable details in visual data. To address this, we introduce 'states', i.e., textual snapshots of object configurations, as a visually-grounded semantic layer that anchors abstract procedures to what a model can actually see. We formalize this insight in a novel Task-Step-State (TSS) framework, where tasks are achieved via steps that drive transitions between observable states. To enforce this structure, we propose a progressive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
