Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations
Jungo Kasai, Yuhei Kasai, Keisuke Sakaguchi, Yutaro Yamada, Dragomir, Radev

TL;DR
This study benchmarks GPT-4, ChatGPT, and GPT-3 on Japanese medical licensing exams, revealing GPT-4's superior performance and highlighting limitations like prohibited answer choices and tokenization issues in Japanese.
Contribution
It introduces a new benchmark for evaluating LLMs on Japanese medical exams and provides insights into their performance and limitations in a non-English language.
Findings
GPT-4 passes all six years of exams
LLMs sometimes suggest prohibited medical choices
Japanese tokenization affects API costs and context size
Abstract
As large language models (LLMs) gain popularity among speakers of diverse languages, we believe that it is crucial to benchmark them to better understand model behaviors, failures, and limitations in languages beyond English. In this work, we evaluate LLM APIs (ChatGPT, GPT-3, and GPT-4) on the Japanese national medical licensing examinations from the past five years, including the current year. Our team comprises native Japanese-speaking NLP researchers and a practicing cardiologist based in Japan. Our experiments show that GPT-4 outperforms ChatGPT and GPT-3 and passes all six years of the exams, highlighting LLMs' potential in a language that is typologically distant from English. However, our evaluation also exposes critical limitations of the current LLM APIs. First, LLMs sometimes select prohibited choices that should be strictly avoided in medical practice in Japan, such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling
Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · Attention Is All You Need · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Transformer · Cosine Annealing · Dropout · Dense Connections
