PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models   with Abstract Visual Patterns

Yew Ken Chia; Vernon Toh Yan Han; Deepanway Ghosal; Lidong Bing,; Soujanya Poria

arXiv:2403.13315·cs.CV·August 20, 2024·1 cites

PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns

Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing,, Soujanya Poria

PDF

Open Access 2 Repos 2 Datasets 1 Video

TL;DR

PuzzleVQA introduces a challenging dataset of abstract pattern puzzles to evaluate and diagnose the reasoning and perception capabilities of large multimodal models, revealing significant limitations in their generalization and reasoning skills.

Contribution

The paper presents PuzzleVQA, a novel dataset of 2000 abstract pattern puzzles designed to evaluate and analyze the reasoning abilities of large multimodal models.

Findings

01

State-of-the-art models struggle with abstract pattern generalization.

02

GPT-4V scores only 46.4% on single-concept puzzles.

03

Main bottlenecks are visual perception and inductive reasoning abilities.

Abstract

Large multimodal models extend the impressive capabilities of large language models by integrating multimodal understanding abilities. However, it is not clear how they can emulate the general intelligence and reasoning ability of humans. As recognizing patterns and abstracting concepts are key to general intelligence, we introduce PuzzleVQA, a collection of 2000 puzzle instances based on abstract patterns. With this dataset, we evaluate large multimodal models with abstract patterns based on fundamental concepts, including colors, numbers, sizes, and shapes. Through our experiments on state-of-the-art large multimodal models, we find that they are not able to generalize well to simple abstract patterns. Notably, GPT-4V achieves a score of 46.4% on single-concept puzzles, which shows that state-of-the-art models struggle on our dataset. To diagnose the reasoning challenges in large…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications