ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions
Shailaja Keyur Sampat, Yezhou Yang, Chitta Baral

TL;DR
ActionCOMET is a zero-shot framework that enables machines to infer detailed commonsense knowledge about actions in images, such as goals, effects, and sequences, using a new multi-modal dataset and language models.
Contribution
The paper introduces a novel multi-modal dataset of images and inferences about actions, and proposes ActionCOMET, a zero-shot approach leveraging language models for action understanding.
Findings
ActionCOMET outperforms existing VQA methods on the dataset.
The dataset contains 8.5k images and 59.3k inferences.
Baseline results demonstrate the effectiveness of the zero-shot approach.
Abstract
Humans observe various actions being performed by other humans (physically or in videos/images) and can draw a wide range of inferences about it beyond what they can visually perceive. Such inferences include determining the aspects of the world that make action execution possible (e.g. liquid objects can undergo pouring), predicting how the world will change as a result of the action (e.g. potatoes being golden and crispy after frying), high-level goals associated with the action (e.g. beat the eggs to make an omelet) and reasoning about actions that possibly precede or follow the current action (e.g. crack eggs before whisking or draining pasta after boiling). Similar reasoning ability is highly desirable in autonomous systems that would assist us in performing everyday tasks. To that end, we propose a multi-modal task to learn aforementioned concepts about actions being performed in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
