ZeroBench: An Impossible Visual Benchmark for Contemporary Large   Multimodal Models

Jonathan Roberts; Mohammad Reza Taesiri; Ansh Sharma; Akash Gupta,; Samuel Roberts; Ioana Croitoru; Simion-Vlad Bogolin; Jialu Tang; Florian; Langer; Vyas Raina; Vatsal Raina; Hanyi Xiong; Vishaal Udandarao; Jingyi Lu,; Shiyang Chen; Sam Purkis; Tianshuo Yan; Wenye Lin; Gyungin Shin; Qiaochu; Yang; Anh Totti Nguyen; David I. Atkinson; Aaditya Baranwal; Alexandru Coca,; Mikah Dang; Sebastian Dziadzio; Jakob D. Kunz; Kaiqu Liang; Alexander Lo,; Brian Pulfer; Steven Walton; Charig Yang; Kai Han; Samuel Albanie

arXiv:2502.09696·cs.CV·March 7, 2025

ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta,, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian, Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu,, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin

PDF

Open Access 1 Datasets

TL;DR

ZeroBench is a challenging visual reasoning benchmark designed to be impossible for current large multimodal models, highlighting their limitations and encouraging future improvements in visual understanding.

Contribution

We introduce ZeroBench, a novel lightweight benchmark with impossible questions for current LMMs, and provide a comprehensive evaluation and analysis of their failures.

Findings

01

All evaluated LMMs scored 0.0% on ZeroBench.

02

ZeroBench remains impossible despite ongoing model progress.

03

The benchmark is publicly available to foster future research.

Abstract

Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

jonathan-roberts1/zerobench
dataset· 921 dl
921 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies