Performance of Large Language Models in Technical MRI Question Answering: A Comparative Study
Alan B McMillan

TL;DR
This study evaluates the accuracy of various large language models in answering technical MRI questions, finding that both open-source and closed-source models perform well, with potential to improve MRI practice and education.
Contribution
The paper provides a comprehensive comparison of multiple LLMs' performance on MRI-related questions, highlighting their potential in clinical and educational settings.
Findings
Closed-source o1 Preview achieved 94% accuracy.
Open-source Phi 3.5 Mini achieved 78% accuracy.
Models performed best in Basic Principles and Instrumentation.
Abstract
Background: Advances in artificial intelligence, particularly large language models (LLMs), have the potential to enhance technical expertise in magnetic resonance imaging (MRI), regardless of operator skill or geographic location. Methods: We assessed the accuracy of several LLMs in answering 570 technical MRI questions derived from a standardized review book. The questions spanned nine MRI topics, including Basic Principles, Image Production, and Safety. Closed-source models (e.g., OpenAI's o1 Preview, GPT-4o, GPT-4 Turbo, and Claude 3.5 Haiku) and open-source models (e.g., Phi 3.5 Mini, Llama 3.1, smolLM2) were tested. Models were queried using standardized prompts via the LangChain framework, and responses were graded against correct answers using an automated scoring protocol. Accuracy, defined as the proportion of correct answers, was the primary outcome. Results: The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Expert finding and Q&A systems
