OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
Qiguang Chen, Chengyu Luan, Jiajun Wu, Qiming Yu, Yi Yang, Yizhuo Li, Jingqi Tong, Xiachong Feng, Libo Qin, Wanxiang Che

TL;DR
OMIBench is a new benchmark for evaluating large vision-language models on multi-image reasoning tasks inspired by Olympiad problems, highlighting current performance gaps.
Contribution
It introduces a benchmark with multi-image reasoning problems across science domains, including annotated rationales and evaluation protocols.
Findings
Existing LVLMs achieve only about 50% accuracy on OMIBench.
OMIBench reveals significant performance gaps in current models.
Benchmark covers biology, chemistry, mathematics, and physics Olympiad problems.
Abstract
Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
