UltraMedical: Building Specialized Generalists in Biomedicine

Kaiyan Zhang; Sihang Zeng; Ermo Hua; Ning Ding; Zhang-Ren Chen,; Zhiyuan Ma; Haoxin Li; Ganqu Cui; Biqing Qi; Xuekai Zhu; Xingtai Lv; Hu; Jinfang; Zhiyuan Liu; Bowen Zhou

arXiv:2406.03949·cs.CL·October 30, 2024·6 cites

UltraMedical: Building Specialized Generalists in Biomedicine

Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen,, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, Xingtai Lv, Hu, Jinfang, Zhiyuan Liu, Bowen Zhou

PDF

Open Access 1 Repo 2 Models 1 Datasets 1 Video

TL;DR

This paper introduces UltraMedical, a collection of high-quality biomedical datasets used to fine-tune Llama-3 based models, significantly advancing specialized medical language understanding and addressing privacy concerns in biomedical AI.

Contribution

The paper provides the UltraMedical datasets and demonstrates their effectiveness in fine-tuning biomedical LLMs, improving performance on medical benchmarks and developing reward models for preference learning.

Findings

01

Fine-tuned models show state-of-the-art performance on medical benchmarks.

02

UltraMedical datasets enable effective preference learning in biomedical LLMs.

03

Reward models improve online preference learning in biomedical AI.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains and are moving towards more specialized areas. Recent advanced proprietary models such as GPT-4 and Gemini have achieved significant advancements in biomedicine, which have also raised privacy and security challenges. The construction of specialized generalists hinges largely on high-quality datasets, enhanced by techniques like supervised fine-tuning and reinforcement learning from human or AI feedback, and direct preference optimization. However, these leading technologies (e.g., preference learning) are still significantly limited in the open source community due to the scarcity of specialized data. In this paper, we present the UltraMedical collections, which consist of high-quality manual and synthetic datasets in the biomedicine domain, featuring preference annotations across multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tsinghuac3i/ultramedical
noneOfficial

Models

Datasets

TsinghuaC3I/UltraMedical-Preference
dataset· 130 dl
130 dl

Videos

UltraMedical: Building Specialized Generalists in Biomedicine· slideslive

Taxonomy

TopicsClinical Reasoning and Diagnostic Skills · Medical Coding and Health Information · Interdisciplinary Research and Collaboration

MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Multi-Head Attention