JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

Yue Xun; Junyu Liu; Qian Niu; Xinyi Wang; Zheng Yuan; Zirui Li; Zequn Zhang; Bowen Zhao; Shujun Wang; Irene Li; Kan Hatakeyama-Sato; Yusuke Iwasawa; Yutaka Matsuo

arXiv:2605.22080·cs.CV·May 22, 2026

JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

Yue Xun, Junyu Liu, Qian Niu, Xinyi Wang, Zheng Yuan, Zirui Li, Zequn Zhang, Bowen Zhao, Shujun Wang, Irene Li, Kan Hatakeyama-Sato, Yusuke Iwasawa, Yutaka Matsuo

PDF

TL;DR

JMed48k is a comprehensive Japanese medical licensing benchmark designed to evaluate vision-language models across multiple professions, incorporating exam questions, images, and an innovative image-removal audit to assess visual evidence utilization.

Contribution

The paper introduces JMed48k, a large-scale, profession-specific Japanese medical licensing benchmark with an evaluation subset and a novel image-removal audit for vision-language models.

Findings

01

Models benefit significantly from visual content, especially proprietary ones.

02

Medical-specific models show limited use of visual evidence in answers.

03

Image removal impacts vary across professions, from +5.7 to +39.8 points.

Abstract

We introduce JMed48k, a multi-profession Japanese healthcare licensing benchmark for evaluating vision-language models. Built from official PDF materials released by the Japanese Ministry of Health, Labour and Welfare, JMed48k contains 48,862 exam questions and 20,142 images from 11 national licensing examinations between 2005 and 2025, with visual content annotated under an 8-type taxonomy. From this corpus, we derive JMed48k-Eval, a recent five-year evaluation subset with 12,484 scored questions, including 9,905 text-only questions and 2,579 questions with images. We evaluate 21 proprietary, open-source, and medical-specific models, reporting text-only and with-image performance separately. Because these subsets contain different questions, we further introduce a paired image-removal audit that evaluates questions with images before and after removing visual content to explore four…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.