Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models
Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo, Chen, Xiaohui Fan, Huajun Chen

TL;DR
Mol-Instructions is a large-scale dataset designed to improve large language models' understanding and prediction abilities in biomolecular research by providing specialized instructions across molecules, proteins, and biomolecular texts.
Contribution
We introduce Mol-Instructions, a comprehensive biomolecular instruction dataset that enhances LLMs' capabilities in the biomolecular domain through extensive instruction tuning.
Findings
Improved LLM performance on biomolecular tasks.
Demonstrated effectiveness of instruction tuning with Mol-Instructions.
Dataset is publicly available for research and updates.
Abstract
Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a comprehensive instruction dataset designed for the biomolecular domain. Mol-Instructions encompasses three key components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions. Each component aims to improve the understanding and prediction capabilities of LLMs concerning biomolecular features and behaviors. Through extensive instruction tuning experiments on LLMs, we demonstrate the effectiveness of Mol-Instructions in enhancing large models' performance in the intricate realm of biomolecular studies, thus fostering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning in Materials Science · Machine Learning in Bioinformatics · Protein Structure and Dynamics
