ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense   Concepts about Actions

Shailaja Keyur Sampat; Yezhou Yang; Chitta Baral

arXiv:2410.13662·cs.CV·October 18, 2024

ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions

Shailaja Keyur Sampat, Yezhou Yang, Chitta Baral

PDF

Open Access 1 Repo

TL;DR

ActionCOMET is a zero-shot framework that enables machines to infer detailed commonsense knowledge about actions in images, such as goals, effects, and sequences, using a new multi-modal dataset and language models.

Contribution

The paper introduces a novel multi-modal dataset of images and inferences about actions, and proposes ActionCOMET, a zero-shot approach leveraging language models for action understanding.

Findings

01

ActionCOMET outperforms existing VQA methods on the dataset.

02

The dataset contains 8.5k images and 59.3k inferences.

03

Baseline results demonstrate the effectiveness of the zero-shot approach.

Abstract

Humans observe various actions being performed by other humans (physically or in videos/images) and can draw a wide range of inferences about it beyond what they can visually perceive. Such inferences include determining the aspects of the world that make action execution possible (e.g. liquid objects can undergo pouring), predicting how the world will change as a result of the action (e.g. potatoes being golden and crispy after frying), high-level goals associated with the action (e.g. beat the eggs to make an omelet) and reasoning about actions that possibly precede or follow the current action (e.g. crack eggs before whisking or draining pasta after boiling). Similar reasoning ability is highly desirable in autonomous systems that would assist us in performing everyday tasks. To that end, we propose a multi-modal task to learn aforementioned concepts about actions being performed in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shailaja183/actionconceptlearning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications