DARWIN Series: Domain Specific Large Language Models for Natural Science
Tong Xie, Yuwei Wan, Wei Huang, Zhenyu Yin, Yixuan Liu, Shaozhou Wang,, Qingyuan Linghu, Chunyu Kit, Clara Grazian, Wenjie Zhang, Imran Razzak, Bram, Hoex

TL;DR
DARWIN series introduces domain-specific large language models tailored for natural science, leveraging scientific knowledge and multi-task training to enhance automation and discovery in physics, chemistry, and materials science.
Contribution
The paper presents a novel series of open-source LLMs fine-tuned with scientific data and introduces SIG for automated instruction generation, advancing AI applications in scientific research.
Findings
Achieved state-of-the-art results on scientific tasks
Reduced reliance on closed-source AI models
Demonstrated effective knowledge injection via SIG
Abstract
Emerging tools bring forth fresh approaches to work, and the field of natural science is no different. In natural science, traditional manual, serial, and labour-intensive work is being augmented by automated, parallel, and iterative processes driven by artificial intelligence-based experimental automation and more. To add new capabilities in natural science, enabling the acceleration and enrichment of automation of the discovery process, we present DARWIN, a series of tailored LLMs for natural science, mainly in physics, chemistry, and material science. This series relies on open-source LLM, incorporating structured and unstructured scientific knowledge from public datasets and literature. We fine-tuned the models using over 60,000 instruction data points, emphasizing factual correctness. During the fine-tuning, we introduce the Scientific Instruction Generation (SIG) model, automating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Materials Science · Scientific Computing and Data Management
