CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination
Hyounghun Kim, Abhay Zala, Mohit Bansal

TL;DR
This paper introduces CoSIm, a new dataset and task for evaluating AI's ability to perform counterfactual scene imagination and reasoning, highlighting the challenge and potential for future improvements.
Contribution
The paper presents a novel dataset and task for counterfactual scene reasoning, along with baseline models and analysis of human versus AI performance.
Findings
Large human-model performance gap identified
Baseline vision-language Transformer achieves limited accuracy
Dataset includes diverse complex scene change scenarios
Abstract
As humans, we can modify our assumptions about a scene by imagining alternative objects or concepts in our minds. For example, we can easily anticipate the implications of the sun being overcast by rain clouds (e.g., the street will get wet) and accordingly prepare for that. In this paper, we introduce a new task/dataset called Commonsense Reasoning for Counterfactual Scene Imagination (CoSIm) which is designed to evaluate the ability of AI systems to reason about scene change imagination. In this task/dataset, models are given an image and an initial question-response pair about the image. Next, a counterfactual imagined scene change (in textual form) is applied, and the model has to predict the new response to the initial question based on this scene change. We collect 3.5K high-quality and challenging data instances, with each instance consisting of an image, a commonsense question…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling
MethodsLinear Layer · Softmax · Multi-Head Attention · Residual Connection · Attention Is All You Need · Dense Connections · Absolute Position Encodings · Dropout · Byte Pair Encoding · Adam
