Anticipating Object State Changes in Long Procedural Videos

Victoria Manousaki; Konstantinos Bacharidis; Filippos Gouidis,; Konstantinos Papoutsakis; Dimitris Plexousakis; Antonis Argyros

arXiv:2405.12789·cs.CV·December 3, 2024

Anticipating Object State Changes in Long Procedural Videos

Victoria Manousaki, Konstantinos Bacharidis, Filippos Gouidis,, Konstantinos Papoutsakis, Dimitris Plexousakis, Antonis Argyros

PDF

Open Access

TL;DR

This paper introduces a new problem of anticipating object state changes in procedural videos, provides curated annotation data, and proposes a novel framework that integrates visual and language features to predict future object states.

Contribution

It presents the first method for object state change anticipation, extending the Ego4D dataset with new annotations and demonstrating the effectiveness of combined visual and NLP cues.

Findings

01

Proposed method accurately predicts future object state changes.

02

Integration of visual and language features improves prediction performance.

03

New annotated dataset (Ego4D-OSCA) supports future research.

Abstract

In this work, we introduce (a) the new problem of anticipating object state changes in images and videos during procedural activities, (b) new curated annotation data for object state change classification based on the Ego4D dataset, and (c) the first method for addressing this challenging problem. Solutions to this new task have important implications in vision-based scene understanding, automated monitoring systems, and action planning. The proposed novel framework predicts object state changes that will occur in the near future due to yet unseen human actions by integrating learned visual features that represent recent visual information with natural language (NLP) features that represent past object state changes and actions. Leveraging the extensive and challenging Ego4D dataset which provides a large-scale collection of first-person perspective videos across numerous interaction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection