ChiMed 2.0: Advancing Chinese Medical Dataset in Facilitating Large Language Modeling

Yuanhe Tian; Junjie Liu; Zhizhou Kou; Yuxiang Li; Yan Song

arXiv:2507.15275·cs.CL·July 22, 2025

ChiMed 2.0: Advancing Chinese Medical Dataset in Facilitating Large Language Modeling

Yuanhe Tian, Junjie Liu, Zhizhou Kou, Yuxiang Li, Yan Song

PDF

TL;DR

ChiMed 2.0 is a comprehensive Chinese medical dataset designed to enhance large language models through pre-training, fine-tuning, and reinforcement learning, significantly improving performance on medical benchmarks.

Contribution

The paper introduces ChiMed 2.0, a large-scale Chinese medical dataset supporting pre-training, fine-tuning, and RLHF, addressing limitations of previous datasets in size and domain coverage.

Findings

01

Performance improvements on medical benchmarks across model scales.

02

Effective support for pre-training, fine-tuning, and RLHF tasks.

03

Validation of dataset's applicability for Chinese medical LLMs.

Abstract

Building high-quality data resources is crucial for advancing artificial intelligence research and applications in specific domains, particularly in the Chinese medical domain. Existing Chinese medical datasets are limited in size and narrow in domain coverage, falling short of the diverse corpora required for effective pre-training. Moreover, most datasets are designed solely for LLM fine-tuning and do not support pre-training and reinforcement learning from human feedback (RLHF). In this paper, we propose a Chinese medical dataset named ChiMed 2.0, which extends our previous work ChiMed, and covers data collected from Chinese medical online platforms and generated by LLMs. ChiMed 2.0 contains 204.4M Chinese characters covering both traditional Chinese medicine classics and modern general medical data, where there are 164.8K documents for pre-training, 351.6K question-answering pairs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.