Joint Discovery of Object States and Manipulation Actions
Jean-Baptiste Alayrac, Josev Sivic, Ivan Laptev, Simon Lacoste-Julien

TL;DR
This paper presents a joint model that automatically discovers object states and manipulation actions from videos, improving understanding of object transformations without extra supervision.
Contribution
It introduces a novel joint learning framework that simultaneously identifies object states and actions in videos, leveraging temporal order constraints and new optimization techniques.
Findings
Discovered seven manipulation actions and object states on a new real-life video dataset.
Joint modeling improves accuracy of object state discovery and action recognition.
The approach operates without additional supervision, relying on temporal consistency.
Abstract
Many human activities involve object manipulations aiming to modify the object state. Examples of common state changes include full/empty bottle, open/closed door, and attached/detached car wheel. In this work, we seek to automatically discover the states of objects and the associated manipulation actions. Given a set of videos for a particular task, we propose a joint model that learns to identify object states and to localize state-modifying actions. Our model is formulated as a discriminative clustering cost with constraints. We assume a consistent temporal order for the changes in object states and manipulation actions, and introduce new optimization techniques to learn model parameters without additional supervision. We demonstrate successful discovery of seven manipulation actions and corresponding object states on a new dataset of videos depicting real-life object manipulations.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Surveillance and Tracking Methods
