ChiMed 2.0: Advancing Chinese Medical Dataset in Facilitating Large Language Modeling
Yuanhe Tian, Junjie Liu, Zhizhou Kou, Yuxiang Li, Yan Song

TL;DR
ChiMed 2.0 is a comprehensive Chinese medical dataset designed to enhance large language models through pre-training, fine-tuning, and reinforcement learning, significantly improving performance on medical benchmarks.
Contribution
The paper introduces ChiMed 2.0, a large-scale Chinese medical dataset supporting pre-training, fine-tuning, and RLHF, addressing limitations of previous datasets in size and domain coverage.
Findings
Performance improvements on medical benchmarks across model scales.
Effective support for pre-training, fine-tuning, and RLHF tasks.
Validation of dataset's applicability for Chinese medical LLMs.
Abstract
Building high-quality data resources is crucial for advancing artificial intelligence research and applications in specific domains, particularly in the Chinese medical domain. Existing Chinese medical datasets are limited in size and narrow in domain coverage, falling short of the diverse corpora required for effective pre-training. Moreover, most datasets are designed solely for LLM fine-tuning and do not support pre-training and reinforcement learning from human feedback (RLHF). In this paper, we propose a Chinese medical dataset named ChiMed 2.0, which extends our previous work ChiMed, and covers data collected from Chinese medical online platforms and generated by LLMs. ChiMed 2.0 contains 204.4M Chinese characters covering both traditional Chinese medicine classics and modern general medical data, where there are 164.8K documents for pre-training, 351.6K question-answering pairs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
