Evaluation of GPT-based large language generative AI models as study aids for the national licensure examination for registered dietitians in Japan

Yuta Nagamori; Mikoto Kosai; Yuji Kawai; Haruka Marumo; Misaki Shibuya; Tatsuya Negishi; Masaki Imanishi; Yasumasa Ikeda; Koichiro Tsuchiya; Asuka Sawai; Licht Miyamoto

arXiv:2508.10011·cs.CL·August 15, 2025

Evaluation of GPT-based large language generative AI models as study aids for the national licensure examination for registered dietitians in Japan

Yuta Nagamori, Mikoto Kosai, Yuji Kawai, Haruka Marumo, Misaki Shibuya, Tatsuya Negishi, Masaki Imanishi, Yasumasa Ikeda, Koichiro Tsuchiya, Asuka Sawai, Licht Miyamoto

PDF

TL;DR

This study evaluates GPT-based AI models as study aids for Japan's dietitian licensure exam, finding limited accuracy and consistency, with some models marginally surpassing passing thresholds but overall requiring further improvements.

Contribution

It provides a comprehensive assessment of current LLM-based AI models' performance in nutritional exam preparation, highlighting their strengths and limitations.

Findings

01

Bing-Precise and Bing-Creative exceeded the 60% passing threshold.

02

Models showed inconsistent answers across repeated attempts.

03

Prompt engineering had limited impact on improving accuracy.

Abstract

Generative artificial intelligence (AI) based on large language models (LLMs), such as ChatGPT, has demonstrated remarkable progress across various professional fields, including medicine and education. However, their performance in nutritional education, especially in Japanese national licensure examination for registered dietitians, remains underexplored. This study aimed to evaluate the potential of current LLM-based generative AI models as study aids for nutrition students. Questions from the Japanese national examination for registered dietitians were used as prompts for ChatGPT and three Bing models (Precise, Creative, Balanced), based on GPT-3.5 and GPT-4. Each question was entered into independent sessions, and model responses were analyzed for accuracy, consistency, and response time. Additional prompt engineering, including role assignment, was tested to assess potential…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.