Performance Evaluation of Lightweight Open-source Large Language Models   in Pediatric Consultations: A Comparative Analysis

Qiuhong Wei; Ying Cui; Mengwei Ding; Yanqin Wang; Lingling Xiang,; Zhengxiong Yao; Ceran Chen; Ying Long; Zhezhen Jin; Ximing Xu

arXiv:2407.15862·cs.LG·July 24, 2024·1 cites

Performance Evaluation of Lightweight Open-source Large Language Models in Pediatric Consultations: A Comparative Analysis

Qiuhong Wei, Ying Cui, Mengwei Ding, Yanqin Wang, Lingling Xiang,, Zhengxiong Yao, Ceran Chen, Ying Long, Zhezhen Jin, Ximing Xu

PDF

Open Access

TL;DR

This study compares the performance of open-source lightweight LLMs and a proprietary model in pediatric healthcare consultations, revealing that while lightweight models show promise, they are still less accurate than larger, proprietary models like ChatGPT-3.5.

Contribution

It provides a comparative analysis of lightweight open-source LLMs versus a large proprietary model in pediatric medical question answering, highlighting current performance gaps.

Findings

01

ChatGLM3-6B outperforms Vicuna models in accuracy and completeness.

02

ChatGPT-3.5 significantly outperforms all lightweight models in accuracy and completeness.

03

All models maintain high safety standards (>98.4%).

Abstract

Large language models (LLMs) have demonstrated potential applications in medicine, yet data privacy and computational burden limit their deployment in healthcare institutions. Open-source and lightweight versions of LLMs emerge as potential solutions, but their performance, particularly in pediatric settings remains underexplored. In this cross-sectional study, 250 patient consultation questions were randomly selected from a public online medical forum, with 10 questions from each of 25 pediatric departments, spanning from December 1, 2022, to October 30, 2023. Two lightweight open-source LLMs, ChatGLM3-6B and Vicuna-7B, along with a larger-scale model, Vicuna-13B, and the widely-used proprietary ChatGPT-3.5, independently answered these questions in Chinese between November 1, 2023, and November 7, 2023. To assess reproducibility, each inquiry was replicated once. We found that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Adolescent and Pediatric Healthcare