Closing the gap between open-source and commercial large language models   for medical evidence summarization

Gongbo Zhang; Qiao Jin; Yiliang Zhou; Song Wang; Betina R. Idnay,; Yiming Luo; Elizabeth Park; Jordan G. Nestor; Matthew E. Spotnitz; Ali; Soroush; Thomas Campion; Zhiyong Lu; Chunhua Weng; Yifan Peng

arXiv:2408.00588·cs.CL·August 2, 2024·2 cites

Closing the gap between open-source and commercial large language models for medical evidence summarization

Gongbo Zhang, Qiao Jin, Yiliang Zhou, Song Wang, Betina R. Idnay,, Yiming Luo, Elizabeth Park, Jordan G. Nestor, Matthew E. Spotnitz, Ali, Soroush, Thomas Campion, Zhiyong Lu, Chunhua Weng, Yifan Peng

PDF

Open Access

TL;DR

This study demonstrates that fine-tuning open-source large language models significantly narrows the performance gap with proprietary models in medical evidence summarization, offering more transparent and customizable options.

Contribution

It investigates the effectiveness of fine-tuning open-source LLMs like PRIMERA, LongT5, and Llama-2 for medical summarization, achieving performance close to proprietary models.

Findings

01

Fine-tuned models improved ROUGE-L, METEOR, and CHRF scores.

02

LongT5 performance approaches GPT-3.5 in zero-shot settings.

03

Smaller fine-tuned models sometimes outperform larger zero-shot models.

Abstract

Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance in summarizing medical evidence. Utilizing a benchmark dataset, MedReview, consisting of 8,161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the fine-tuned LLMs obtained an increase of 9.89 in ROUGE-L (95% confidence interval: 8.94-10.81), 13.21 in METEOR score (95% confidence interval: 12.05-14.37),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Machine Learning in Healthcare

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Linear Layer · Attention Dropout · Residual Connection · Multi-Head Attention · {Dispute@FaQ-s}How to file a dispute with Expedia? · Cosine Annealing · Weight Decay