JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation

Issa Sugiura; Koki Maeda; Shuhei Kurita; Yusuke Oda; Daisuke Kawahara; Naoaki Okazaki

arXiv:2604.00909·cs.CV·April 7, 2026

JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation

Issa Sugiura, Koki Maeda, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, Naoaki Okazaki

PDF

2 Repos 1 Models 1 Datasets

TL;DR

JAMMEval is a systematically refined collection of Japanese vision-language benchmarks designed to improve evaluation reliability and better reflect model capabilities.

Contribution

It introduces a human-annotated refinement process for Japanese VQA benchmarks, enhancing data quality and evaluation consistency.

Findings

01

Refined benchmarks yield more accurate model evaluation scores.

02

Evaluation scores show lower variance and better distinguish model capabilities.

03

The dataset and code are publicly released for community use.

Abstract

Reliable evaluation is essential for the development of vision-language models (VLMs). However, Japanese VQA benchmarks have undergone far less iterative refinement than their English counterparts. As a result, many existing benchmarks contain issues such as ambiguous questions, incorrect answers, and instances that can be solved without visual grounding, undermining evaluation reliability and leading to misleading conclusions in model comparisons. To address these limitations, we introduce JAMMEval, a refined collection of Japanese benchmarks for reliable VLM evaluation. It is constructed by systematically refining seven existing Japanese benchmark datasets through two rounds of human annotation, improving both data quality and evaluation reliability. In our experiments, we evaluate open-weight and proprietary VLMs on JAMMEval and analyze the capabilities of recent models on Japanese…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
llm-jp/llm-jp-4-vl-9b-beta
model· 2.4k dl· ♡ 12
2.4k dl♡ 12

Datasets

llm-jp/JAMMEval
dataset· 754 dl
754 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.