CareBot: A Pioneering Full-Process Open-Source Medical Language Model

Lulu Zhao; Weihao Zeng; Xiaofeng Shi; Hua Zhou

arXiv:2412.15236·cs.CL·December 24, 2024

CareBot: A Pioneering Full-Process Open-Source Medical Language Model

Lulu Zhao, Weihao Zeng, Xiaofeng Shi, Hua Zhou

PDF

Open Access 1 Video

TL;DR

CareBot is a novel bilingual medical language model that employs innovative training techniques and data quality assessment to significantly improve performance in medical consultation and education tasks, setting new standards for open-source medical AI.

Contribution

The paper introduces a two-stage continuous pre-training method, DataRater for data quality, and ConFilter for dialogue enhancement, advancing open-source medical language models.

Findings

01

Outperforms existing models in Chinese and English medical benchmarks.

02

Enhances multi-turn dialogue quality in medical conversations.

03

Demonstrates effective application in medical consultation and education.

Abstract

Recently, both closed-source LLMs and open-source communities have made significant strides, outperforming humans in various general domains. However, their performance in specific professional domains such as medicine, especially within the open-source community, remains suboptimal due to the complexity of medical knowledge. In this paper, we propose CareBot, a bilingual medical LLM, which leverages a comprehensive approach integrating continuous pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning with human feedback (RLHF). Our novel two-stage CPT method, comprising Stable CPT and Boost CPT, effectively bridges the gap between general and domain-specific data, facilitating a smooth transition from pre-training to fine-tuning and enhancing domain knowledge progressively. We also introduce DataRater, a model designed to assess data quality during CPT, ensuring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CareBot: A Pioneering Full-Process Open-Source Medical Language Model· underline

Taxonomy

TopicsSemantic Web and Ontologies · Biomedical Text Mining and Ontologies · Electronic Health Records Systems

MethodsSparse Evolutionary Training · Shrink and Fine-Tune