HoneyBee: Progressive Instruction Finetuning of Large Language Models for Materials Science
Yu Song, Santiago Miret, Huan Zhang, Bang Liu

TL;DR
HoneyBee is a specialized large language model for materials science, developed through an instruction-based data curation process that improves data quality and model performance in this domain.
Contribution
This work introduces MatSci-Instruct, a novel instruction-based data curation method, and applies it to finetune HoneyBee, the first billion-parameter language model dedicated to materials science.
Findings
HoneyBee outperforms existing models on materials science tasks.
Iterative instruction refinement improves model performance.
The dataset and code are publicly available for further research.
Abstract
We propose an instruction-based process for trustworthy data curation in materials science (MatSci-Instruct), which we then apply to finetune a LLaMa-based language model targeted for materials science (HoneyBee). MatSci-Instruct helps alleviate the scarcity of relevant, high-quality materials science textual data available in the open literature, and HoneyBee is the first billion-parameter language model specialized to materials science. In MatSci-Instruct we improve the trustworthiness of generated data by prompting multiple commercially available large language models for generation with an Instructor module (e.g. Chat-GPT) and verification from an independent Verifier module (e.g. Claude). Using MatSci-Instruct, we construct a dataset of multiple tasks and measure the quality of our dataset along multiple dimensions, including accuracy against known facts, relevance to materials…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Topic Modeling · Natural Language Processing Techniques
