MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge
Perry E. Radau

TL;DR
MRI-Eval introduces a tiered benchmark for evaluating large language models on MRI physics and GE scanner operational knowledge, revealing high multiple-choice accuracy but weaker free-text recall, especially for vendor-specific details.
Contribution
This work presents a novel, comprehensive MRI-focused benchmark with tiered difficulty, highlighting limitations of current LLMs in vendor-specific operational knowledge.
Findings
High overall MCQ accuracy (93.2%-97.1%) across models.
GE scanner operations scored lowest among categories.
Stem-only accuracy drops significantly, especially for GE-specific questions.
Abstract
Background: Existing MRI LLM benchmarks rely mainly on review-book multiple-choice questions, where top proprietary models already score highly, limiting discrimination. No systematic benchmark has evaluated vendor-specific scanner operational knowledge central to research MRI practice. Purpose: We developed MRI-Eval, a tiered benchmark for relative model comparison on MRI physics and GE scanner operations knowledge using primary multiple-choice questions (MCQ), with stem-only and primed diagnostic conditions as complementary analyses. Methods: MRI-Eval includes 1365 scored items across nine categories and three difficulty tiers from textbooks, GE scanner manuals, programming course materials, and expert-generated questions. Five model families were evaluated (GPT-5.4, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, Llama 3.3 70B). MCQ was primary; stem-only removed options and used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
