Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models
Haonan He, Yuchen Ren, Yining Tang, Ziyang Xu, Junxian Li, Minghao Yang, Di Zhang, Dong Yuan, Tao Chen, Shufei Zhang, Yuqiang Li, Nanqing Dong, Wanli Ouyang, Dongzhan Zhou, Peng Ye

TL;DR
This paper introduces Biology-Instructions, a large-scale dataset and benchmark for training and evaluating large language models on multi-omics biological sequences, aiming to improve their understanding and reasoning in complex biological tasks.
Contribution
It provides the first instruction-tuning dataset for multi-omics sequences and proposes ChatMultiOmics, a novel training pipeline that enhances LLMs' biological understanding.
Findings
Current LLMs perform poorly on multi-omics tasks without specialized training.
ChatMultiOmics achieves superior performance on biological understanding tasks.
The dataset and baseline are publicly available for future research.
Abstract
Large language models (LLMs) have shown remarkable capabilities in general domains, but their application to multi-omics biology remains underexplored. To address this gap, we introduce Biology-Instructions, the first large-scale instruction-tuning dataset for multi-omics biological sequences, including DNA, RNA, proteins, and multi-molecules. This dataset bridges LLMs and complex biological sequence-related tasks, enhancing their versatility and reasoning while maintaining conversational fluency. We also highlight significant limitations of current state-of-the-art LLMs on multi-omics tasks without specialized training. To overcome this, we propose ChatMultiOmics, a strong baseline with a novel three-stage training pipeline, demonstrating superior biological understanding through Biology-Instructions. Both resources are publicly available, paving the way for better integration of LLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBioinformatics and Genomic Networks · Biomedical Text Mining and Ontologies · Genetics, Bioinformatics, and Biomedical Research
