MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge

Perry E. Radau

arXiv:2605.05175·eess.IV·May 7, 2026

MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge

Perry E. Radau

PDF

TL;DR

MRI-Eval introduces a tiered benchmark for evaluating large language models on MRI physics and GE scanner operational knowledge, revealing high multiple-choice accuracy but weaker free-text recall, especially for vendor-specific details.

Contribution

This work presents a novel, comprehensive MRI-focused benchmark with tiered difficulty, highlighting limitations of current LLMs in vendor-specific operational knowledge.

Findings

01

High overall MCQ accuracy (93.2%-97.1%) across models.

02

GE scanner operations scored lowest among categories.

03

Stem-only accuracy drops significantly, especially for GE-specific questions.

Abstract

Background: Existing MRI LLM benchmarks rely mainly on review-book multiple-choice questions, where top proprietary models already score highly, limiting discrimination. No systematic benchmark has evaluated vendor-specific scanner operational knowledge central to research MRI practice. Purpose: We developed MRI-Eval, a tiered benchmark for relative model comparison on MRI physics and GE scanner operations knowledge using primary multiple-choice questions (MCQ), with stem-only and primed diagnostic conditions as complementary analyses. Methods: MRI-Eval includes 1365 scored items across nine categories and three difficulty tiers from textbooks, GE scanner manuals, programming course materials, and expert-generated questions. Five model families were evaluated (GPT-5.4, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, Llama 3.3 70B). MCQ was primary; stem-only removed options and used…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.