Cardiology knowledge assessment of retrieval-augmented open versus proprietary large language models

Constantine Tarabanis; Shaan Khurshid; Areti Karamanou; Rodo Piperaki; Lucas A. Mavromatis; Aris Hatzimemos; Dimitrios Tachmatzidis; Constantinos Bakogiannis; Vassilios Vassilikos; Patrick T. Ellinor; Lior Jankelson; Evangelos Kalampokis

PMC · DOI:10.1371/journal.pdig.0001029·March 12, 2026

Cardiology knowledge assessment of retrieval-augmented open versus proprietary large language models

Constantine Tarabanis, Shaan Khurshid, Areti Karamanou, Rodo Piperaki, Lucas A. Mavromatis, Aris Hatzimemos, Dimitrios Tachmatzidis, Constantinos Bakogiannis, Vassilios Vassilikos, Patrick T. Ellinor, Lior Jankelson, Evangelos Kalampokis

PDF

Open Access

TL;DR

Open-weight AI models performed as well or better than commercial models in cardiology tests, especially when given access to medical references.

Contribution

Demonstrated that open-weight models with retrieval-augmented generation can outperform proprietary models in cardiovascular medicine assessments.

Findings

01

The open-weight model DeepSeek R1 outperformed all proprietary models and the human average in cardiology board-style questions.

02

Retrieval-Augmented Generation improved performance across all models, with the most significant gains for smaller models.

03

Open-weight models offer a viable, lower-cost alternative to proprietary models for clinical applications due to their transparency and configurability.

Abstract

To evaluate the performance of open-weight and proprietary LLMs, with and without Retrieval-Augmented Generation (RAG), on cardiology board-style questions and benchmark them against the human average. We tested 14 LLMs (6 open-weight, 8 proprietary) on 449 multiple-choice questions from the American College of Cardiology Self-Assessment Program (ACCSAP). Accuracy was measured as percent correct. RAG was implemented using a knowledge base of 123 guideline and textbook documents. The open-weight model DeepSeek R1 achieved the highest accuracy at 86.9% (95% CI: 83.4–89.7%), outperforming proprietary models and the human average of 78%. GPT 4o (80.9%, 95% CI: 77.0–84.2%) and the commercial platform OpenEvidence (81.3%, 95% CI: 77.4–84.7%) demonstrated similar performance. A positive correlation between model size and performance was observed within model families, but across families,…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Figures3

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning in Healthcare