Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models

Seyed Amir Ahmad Safavi-Naini; Shuhaib Ali; Omer Shahab; Zahra Shahhoseini; Thomas Savage; Sara Rafiee; Jamil S Samaan; Reem Al Shabeeb; Farah Ladak; Jamie O Yang; Juan Echavarria; Sumbal Babar; Aasma Shaukat; Samuel Margolis; Nicholas P Tatonetti; Girish Nadkarni; Bara El Kurdi; Ali Soroush

arXiv:2409.00084·cs.CL·April 10, 2026·5 cites

Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models

Seyed Amir Ahmad Safavi-Naini, Shuhaib Ali, Omer Shahab, Zahra Shahhoseini, Thomas Savage, Sara Rafiee, Jamil S Samaan, Reem Al Shabeeb, Farah Ladak, Jamie O Yang, Juan Echavarria, Sumbal Babar, Aasma Shaukat, Samuel Margolis, Nicholas P Tatonetti, Girish Nadkarni, Bara El Kurdi

PDF

TL;DR

This study evaluates the performance of various large language and vision-language models in gastroenterology, highlighting the strengths of proprietary models and challenges in visual data integration.

Contribution

It systematically compares multiple LLMs and VLMs on gastroenterology questions, including proprietary, open-source, and quantized models, revealing performance insights.

Findings

01

GPT-4o and Claude3.5-Sonnet achieved highest accuracy (~74%).

02

Quantized open-source models performed comparably to full-precision models.

03

VLM performance on images did not improve with images or captions, but improved with human descriptions.

Abstract

Background and Aims: This study evaluates the medical reasoning performance of large language models (LLMs) and vision language models (VLMs) in gastroenterology. Methods: We used 300 gastroenterology board exam-style multiple-choice questions, 138 of which contain images to systematically assess the impact of model configurations and parameters and prompt engineering strategies utilizing GPT-3.5. Next, we assessed the performance of proprietary and open-source LLMs (versions), including GPT (3.5, 4, 4o, 4omini), Claude (3, 3.5), Gemini (1.0), Mistral, Llama (2, 3, 3.1), Mixtral, and Phi (3), across different interfaces (web and API), computing environments (cloud and local), and model precisions (with and without quantization). Finally, we assessed accuracy using a semiautomated pipeline. Results: Among the proprietary models, GPT-4o (73.7%) and Claude3.5-Sonnet (74.0%) achieved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.