# Evaluating Chat Generative Pretrained Transformer (GPT-4o) Problem-Solving Performance in the Japan Certificate Examination for Biomedical Engineering Class 1

**Authors:** Kai Ishida

PMC · DOI: 10.7759/cureus.81029 · 2025-03-23

## TL;DR

This study tested ChatGPT's ability to solve biomedical engineering exam questions and found it performed below human-level accuracy, especially in problem-solving and understanding complex concepts.

## Contribution

The novel contribution is evaluating ChatGPT's performance on a specific biomedical engineering certification exam, revealing its limitations in accuracy and knowledge depth.

## Key findings

- ChatGPT scored below 70% on fundamental and applied knowledge questions.
- Over 80% of incorrect answers were due to lack of knowledge or hallucinations.
- Performance met passing criteria only in one of three exams tested.

## Abstract

Introduction

Chat generative pretrained transformer (ChatGPT; OpenAI, San Francisco, CA) has developed rapidly and is used in various fields, including medical engineering. Japan’s Certificate Examination for Biomedical Engineering class 1 (CEBM1) is responsible for the assessment of comprehensive specialized knowledge and skills centered on the maintenance and safety management of medical devices, systems, and related equipment. This study evaluated the performance of ChatGPT (GPT-4o) on CEBM1 for comparison to human-level expectations.

Methods

We targeted 171 questions including testing for knowledge with fundamental, applied, and problem-solving abilities from the 26th to 28th CEBM1s. We inputted the Japanese version of questions to ChatGPT (GPT-4o), and evaluated performance based on question difficulty. No prompt optimizations were used. We compared the responses provided by ChatGPT with the correct answers.

Results

The number of correct answers was 39 (68.4±10.5%) for questions testing fundamental knowledge, 33 (57.9±5.3%) for questions testing applied knowledge, and 38 (59.6±8.0%) for questions testing problem-solving ability. There was no statistically significant difference among the three groups. The passing criteria of 60% or higher was achieved only for the 28th examination. However, over 80% of the questions answered incorrectly were due to a lack of knowledge or incorrect knowledge. When asked questions about the background causes and specific countermeasures for problems related to medical devices, the questions were misunderstood, and in certain cases, answers were generated as hallucinations.

Conclusions

Currently, ChatGPT possesses a certain level of knowledge in medical engineering; however, it cannot be considered universally accurate in solving all possible problems.

## Full-text entities

- **Diseases:** hallucinations (MESH:D006212)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12013459/full.md

---
Source: https://tomesphere.com/paper/PMC12013459