TL;DR
RebusBench is a new benchmark designed to evaluate the ability of vision-language models to perform complex, multi-step cognitive reasoning required to solve rebus puzzles, highlighting current models' deficiencies in this area.
Contribution
The paper introduces RebusBench, a benchmark of 1,164 puzzles, and evaluates state-of-the-art models, revealing significant gaps in their cognitive visual reasoning capabilities.
Findings
Models perform below 10% exact match on RebusBench.
Scaling models and in-context learning do not significantly improve performance.
Current models lack the cognitive reasoning ability to connect perception and knowledge.
Abstract
Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
