JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation
Issa Sugiura, Koki Maeda, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, Naoaki Okazaki

TL;DR
JAMMEval is a systematically refined collection of Japanese vision-language benchmarks designed to improve evaluation reliability and better reflect model capabilities.
Contribution
It introduces a human-annotated refinement process for Japanese VQA benchmarks, enhancing data quality and evaluation consistency.
Findings
Refined benchmarks yield more accurate model evaluation scores.
Evaluation scores show lower variance and better distinguish model capabilities.
The dataset and code are publicly released for community use.
Abstract
Reliable evaluation is essential for the development of vision-language models (VLMs). However, Japanese VQA benchmarks have undergone far less iterative refinement than their English counterparts. As a result, many existing benchmarks contain issues such as ambiguous questions, incorrect answers, and instances that can be solved without visual grounding, undermining evaluation reliability and leading to misleading conclusions in model comparisons. To address these limitations, we introduce JAMMEval, a refined collection of Japanese benchmarks for reliable VLM evaluation. It is constructed by systematically refining seven existing Japanese benchmark datasets through two rounds of human annotation, improving both data quality and evaluation reliability. In our experiments, we evaluate open-weight and proprietary VLMs on JAMMEval and analyze the capabilities of recent models on Japanese…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
