$\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles
Trishanu Das, Abhilash Nandy, Khush Bajaj, Deepiha S

TL;DR
This paper introduces a large, diverse benchmark for evaluating vision-language models on Rebus Puzzles, and proposes a reasoning framework that improves model performance significantly.
Contribution
The paper presents $ig| ightarrow oxed{ ext{BUS}} ig|$, a comprehensive Rebus Puzzle benchmark, and introduces RebusDescProgICE, a reasoning framework that enhances model accuracy on this task.
Findings
Benchmark contains 1,333 puzzles across 18 categories.
RebusDescProgICE improves model performance by 2.1-4.1% (closed-source) and 20-30% (open-source).
Models show improved understanding of complex, multi-step reasoning tasks.
Abstract
Understanding Rebus Puzzles (Rebus Puzzles use pictures, symbols, and letters to represent words or phrases creatively) requires a variety of skills such as image recognition, cognitive skills, commonsense reasoning, multi-step reasoning, image-based wordplay, etc., making this a challenging task for even current Vision-Language Models. In this paper, we present , a large and diverse benchmark of English Rebus Puzzles containing different artistic styles and levels of difficulty, spread across 18 categories such as food, idioms, sports, finance, entertainment, etc. We also propose , a model-agnostic framework which uses a combination of an unstructured description and code-based, structured reasoning, along with better, reasoning-based in-context example selection, improving the performance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
