Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

Kyosuke Takami; Yuka Tateisi; Satoshi Sekine; Yusuke Miyao

arXiv:2605.11663·cs.CL·May 13, 2026

Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

Kyosuke Takami, Yuka Tateisi, Satoshi Sekine, Yusuke Miyao

PDF

1 Repo

TL;DR

This paper introduces a large-scale, authentic multimodal educational dataset from Japan's national assessments, enabling evaluation of multimodal language models in real exam scenarios.

Contribution

It provides a unique, real-world benchmark with aggregated student responses, preserving authentic exam layouts and Japanese educational content for multimodal model evaluation.

Findings

01

Substantial variation in model accuracy across subjects.

02

Strong sensitivity of models to visual reasoning demands.

03

Human evaluation supports the reliability of automatic scoring.

Abstract

Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a multimodal dataset constructed from Japan's National Assessment of Academic Ability, comprising officially released middle-school items in Science, Mathematics, and Japanese Language. Unlike existing benchmarks based on synthetic or curated data, our dataset preserves real exam layouts, diagrams, and Japanese educational text, together with nationwide aggregated student response distributions (N $\approx$ 900{,}000). These features enable direct comparison between human and model performance under a unified evaluation framework. We benchmark recent multimodal LLMs using exact-match accuracy and character-level F1 for open-ended responses, observing substantial variation across subjects and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

KyosukeTakami/gakucho-benchmark
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.