JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

Atsuyuki Miyai; Shota Onohara; Jeonghun Baek; Kiyoharu Aizawa

arXiv:2512.14620·cs.CL·December 17, 2025

JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

Atsuyuki Miyai, Shota Onohara, Jeonghun Baek, Kiyoharu Aizawa

PDF

Open Access 1 Datasets

TL;DR

JMMMU-Pro is a new Japanese multimodal benchmark created by combining images and text into a single visual question, challenging open-source models and advancing evaluation of Japanese multimodal understanding.

Contribution

The paper introduces JMMMU-Pro, a novel high-quality Japanese multimodal benchmark constructed via a scalable human-in-the-loop image generation process.

Findings

01

Open-source LMMs perform poorly on JMMMU-Pro

02

Vibe Benchmark Construction enables efficient benchmark creation

03

JMMMU-Pro provides a rigorous evaluation of Japanese multimodal understanding

Abstract

This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

JMMMU/JMMMU-Pro
dataset· 348 dl
348 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Handwritten Text Recognition Techniques