JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding   Benchmark for Culture-aware Evaluation

Shota Onohara; Atsuyuki Miyai; Yuki Imajuku; Kazuki Egashira; Jeonghun; Baek; Xiang Yue; Graham Neubig; Kiyoharu Aizawa

arXiv:2410.17250·cs.CL·March 20, 2025

JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation

Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Kazuki Egashira, Jeonghun, Baek, Xiang Yue, Graham Neubig, Kiyoharu Aizawa

PDF

Open Access 1 Models 1 Datasets 1 Video

TL;DR

JMMMU is a comprehensive Japanese benchmark for evaluating large multimodal models on both language and cultural understanding, revealing language and cultural gaps in current models and guiding future improvements.

Contribution

This work introduces the first large-scale Japanese multimodal benchmark with culture-aware evaluation subsets, highlighting language and cultural understanding gaps in existing models.

Findings

01

Performance drops in Japanese LMMs on culture-agnostic subset

02

Inadequate Japanese cultural understanding revealed by culture-specific subset

03

Some models perform well on language but poorly on cultural tasks

Abstract

Accelerating research on Large Multimodal Models (LMMs) in non-English languages is crucial for enhancing user experiences across broader populations. In this paper, we introduce JMMMU (Japanese MMMU), the first large-scale Japanese benchmark designed to evaluate LMMs on expert-level tasks based on the Japanese cultural context. To facilitate comprehensive culture-aware evaluation, JMMMU features two complementary subsets: (i) culture-agnostic (CA) subset, where the culture-independent subjects (e.g., Math) are selected and translated into Japanese, enabling one-to-one comparison with its English counterpart MMMU; and (ii) culture-specific (CS) subset, comprising newly crafted subjects that reflect Japanese cultural context. Using the CA subset, we observe performance drop in many LMMs when evaluated in Japanese, which is purely attributable to language variation. Using the CS subset,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
sbintuitions/sarashina2.2-vision-3b
model· 824 dl· ♡ 16
824 dl♡ 16

Datasets

JMMMU/JMMMU
dataset· 1.5k dl
1.5k dl

Videos

JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation· underline

Taxonomy

TopicsEFL/ESL Teaching and Learning