Benchmarking and Adapting On-Device LLMs for Clinical Decision Support
Alif Munim, Jun Ma, Omar Ibrahim, Alhusain Abdalla, Shuolin Yin, Leo Chen, and Bo Wang

TL;DR
This study benchmarks on-device open-source large language models for clinical decision support, demonstrating their competitive performance and adaptability, which supports privacy-preserving healthcare applications.
Contribution
It provides a comprehensive comparison of open-source on-device LLMs with proprietary models and shows how fine-tuning enhances their clinical diagnostic accuracy.
Findings
On-device models perform comparably to or better than some proprietary models.
Fine-tuning significantly improves diagnostic accuracy, approaching proprietary model performance.
Most diagnostic errors are clinically plausible, not off-topic.
Abstract
Large language models (LLMs) have rapidly advanced in clinical decision-making, yet the deployment of proprietary systems is hindered by privacy concerns and reliance on cloud-based infrastructure. Open-source alternatives allow local inference but often have large model sizes that limit their use in resource-constrained clinical settings. Here, we benchmark on-device LLMs from the gpt-oss (20b, 120b), Qwen3.5 (9B, 27B, 35B), and Gemma 4 (31B) families across three representative clinical tasks: general disease diagnosis, specialty-specific (ophthalmology) diagnosis and management, and simulation of human expert grading and evaluation. We compare their performance with state-of-the-art proprietary models (GPT-5.1, GPT-5-mini, and Gemini 3.1 Pro) and a leading open-source model (DeepSeek-R1), and we further evaluate the adaptability of on-device systems by fine-tuning gpt-oss-20b and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
