Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

Heekyung Lee; Jiaxin Ge; Tsung-Han Wu; Minwoo Kang; Trevor Darrell; David M. Chan

arXiv:2505.23759·cs.CL·September 18, 2025

Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

Heekyung Lee, Jiaxin Ge, Tsung-Han Wu, Minwoo Kang, Trevor Darrell, David M. Chan

PDF

Open Access 1 Repo

TL;DR

This paper evaluates how well current vision-language models interpret rebus puzzles, revealing their strengths in simple clues but significant struggles with abstract reasoning and visual metaphors.

Contribution

It introduces a new benchmark of diverse rebus puzzles and analyzes VLMs' performance, highlighting their limitations in complex multi-modal reasoning tasks.

Findings

01

VLMs perform well on simple visual clues

02

Struggle with abstract reasoning and metaphors

03

Benchmark reveals gaps in current models' capabilities

Abstract

Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head" over "heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kyunnilee/visual_puzzles
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Neurobiology of Language and Bilingualism