Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations

Jungo Kasai; Yuhei Kasai; Keisuke Sakaguchi; Yutaro Yamada; Dragomir; Radev

arXiv:2303.18027·cs.CL·April 6, 2023·50 cites

Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations

Jungo Kasai, Yuhei Kasai, Keisuke Sakaguchi, Yutaro Yamada, Dragomir, Radev

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

This study benchmarks GPT-4, ChatGPT, and GPT-3 on Japanese medical licensing exams, revealing GPT-4's superior performance and highlighting limitations like prohibited answer choices and tokenization issues in Japanese.

Contribution

It introduces a new benchmark for evaluating LLMs on Japanese medical exams and provides insights into their performance and limitations in a non-English language.

Findings

01

GPT-4 passes all six years of exams

02

LLMs sometimes suggest prohibited medical choices

03

Japanese tokenization affects API costs and context size

Abstract

As large language models (LLMs) gain popularity among speakers of diverse languages, we believe that it is crucial to benchmark them to better understand model behaviors, failures, and limitations in languages beyond English. In this work, we evaluate LLM APIs (ChatGPT, GPT-3, and GPT-4) on the Japanese national medical licensing examinations from the past five years, including the current year. Our team comprises native Japanese-speaking NLP researchers and a practicing cardiologist based in Japan. Our experiments show that GPT-4 outperforms ChatGPT and GPT-3 and passes all six years of the exams, highlighting LLMs' potential in a language that is typologically distant from English. However, our evaluation also exposes critical limitations of the current LLM APIs. First, LLMs sometimes select prohibited choices that should be strictly avoided in medical practice in Japan, such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jungokasai/igakuqa
noneOfficial

Models

🤗
EQUES/MedLLama3-JP-v2
model· 141 dl· ♡ 2
141 dl♡ 2

Datasets

Coldog2333/JMedBench
dataset· 886 dl
886 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling

Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · Attention Is All You Need · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Transformer · Cosine Annealing · Dropout · Dense Connections