What If the TV Was Off? Examining Counterfactual Reasoning Abilities of   Multi-modal Language Models

Letian Zhang; Xiaotong Zhai; Zhongkai Zhao; Yongshuo Zong; Xin Wen,; Bingchen Zhao

arXiv:2310.06627·cs.CL·April 17, 2024

What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models

Letian Zhang, Xiaotong Zhai, Zhongkai Zhao, Yongshuo Zong, Xin Wen,, Bingchen Zhao

PDF

Open Access 1 Repo

TL;DR

This paper investigates the counterfactual reasoning abilities of multi-modal language models using a new dataset, revealing significant performance gaps compared to human reasoning and providing a benchmark for future improvements.

Contribution

Introduces the C-VQA dataset to evaluate counterfactual reasoning in vision-language models and demonstrates current models' substantial performance drops on this benchmark.

Findings

01

Models show up to 40% performance decrease on counterfactual questions.

02

Current models significantly lag behind human-like reasoning capabilities.

03

The dataset provides a new standard for evaluating counterfactual reasoning in multi-modal models.

Abstract

Counterfactual reasoning, a fundamental aspect of human cognition, involves contemplating alternatives to established facts or past events, significantly enhancing our abilities in planning and decision-making. In light of the advancements in current multi-modal large language models, we explore their effectiveness in counterfactual reasoning. To facilitate this investigation, we introduce a novel dataset, C-VQA, specifically designed to test the counterfactual reasoning capabilities of modern multi-modal large language models. This dataset is constructed by infusing original questions with counterfactual presuppositions, spanning various types such as numerical and boolean queries. It encompasses a mix of real and synthetic data, representing a wide range of difficulty levels. Our thorough evaluations of contemporary vision-language models using this dataset have revealed substantial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

letian2003/c-vqa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Softmax · Byte Pair Encoding · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection