MedINST: Meta Dataset of Biomedical Instructions

Wenhan Han; Meng Fang; Zihan Zhang; Yu Yin; Zirui Song; Ling Chen,; Mykola Pechenizkiy; Qingyu Chen

arXiv:2410.13458·cs.CL·October 18, 2024

MedINST: Meta Dataset of Biomedical Instructions

Wenhan Han, Meng Fang, Zihan Zhang, Yu Yin, Zirui Song, Ling Chen,, Mykola Pechenizkiy, Qingyu Chen

PDF

Open Access 1 Repo 3 Models 2 Datasets

TL;DR

MedINST is a comprehensive biomedical instruction dataset with 133 tasks and over 7 million samples, designed to improve large language models' generalization in biomedical NLP through multi-task training and benchmarking.

Contribution

We introduce MedINST, the largest multi-domain biomedical instruction dataset, and create MedINST32, a benchmark to evaluate LLMs' generalization in biomedical NLP.

Findings

01

Fine-tuning LLMs on MedINST improves cross-task performance.

02

MedINST32 presents diverse challenges for evaluating biomedical LLMs.

03

Enhanced generalization capabilities demonstrated on the benchmark.

Abstract

The integration of large language model (LLM) techniques in the field of medical analysis has brought about significant advancements, yet the scarcity of large, diverse, and well-annotated datasets remains a major challenge. Medical data and tasks, which vary in format, size, and other parameters, require extensive preprocessing and standardization for effective use in training LLMs. To address these challenges, we introduce MedINST, the Meta Dataset of Biomedical Instructions, a novel multi-domain, multi-task instructional meta-dataset. MedINST comprises 133 biomedical NLP tasks and over 7 million training samples, making it the most comprehensive biomedical instruction dataset to date. Using MedINST as the meta dataset, we curate MedINST32, a challenging benchmark with different task difficulties aiming to evaluate LLMs' generalization ability. We fine-tune several LLMs on MedINST and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aialt/medinst
noneOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHealth Sciences Research and Education