Eye-Q: A Multilingual Benchmark for Visual Word Puzzle Solving and Image-to-Phrase Reasoning

Ali Najar; Alireza Mirrokni; Arshia Izadyari; Sadegh Mohammadian; Amir Homayoon Sharifizade; Asal Meskin; Mobin Bagherian; Ehsaneddin Asgari

arXiv:2601.03400·cs.CV·January 8, 2026

Eye-Q: A Multilingual Benchmark for Visual Word Puzzle Solving and Image-to-Phrase Reasoning

Ali Najar, Alireza Mirrokni, Arshia Izadyari, Sadegh Mohammadian, Amir Homayoon Sharifizade, Asal Meskin, Mobin Bagherian, Ehsaneddin Asgari

PDF

Open Access

TL;DR

Eye-Q introduces a challenging multilingual benchmark with visual word puzzles that test complex reasoning, abstraction, and cross-lingual understanding in vision-language models, exposing significant gaps in current capabilities.

Contribution

This paper presents Eye-Q, a novel multilingual benchmark for visual word puzzle solving that emphasizes reasoning and inference beyond surface recognition.

Findings

01

State-of-the-art models achieve only 60.27% accuracy on Eye-Q.

02

Models struggle with abstract and cross-lingual puzzles.

03

Current models lack flexible conceptual reasoning for image-to-phrase tasks.

Abstract

Vision-Language Models (VLMs) have achieved strong performance on standard vision-language benchmarks, yet often rely on surface-level recognition rather than deeper reasoning. We propose visual word puzzles as a challenging alternative, as they require discovering implicit visual cues, generating and revising hypotheses, and mapping perceptual evidence to non-literal concepts in ways that are difficult to solve via literal grounding, OCR-heavy shortcuts, or simple retrieval-style matching. We introduce Eye-Q, a multilingual benchmark designed to assess this form of complex visual understanding. Eye-Q contains 1,343 puzzles in which a model observes a conceptually dense scene with a brief description and must infer a specific target word or phrase. The puzzles are intentionally unstructured and cue-implicit, with distractors and contextual relationships that demand selective attention,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Neurobiology of Language and Bilingualism