HoneyBee: Progressive Instruction Finetuning of Large Language Models   for Materials Science

Yu Song; Santiago Miret; Huan Zhang; Bang Liu

arXiv:2310.08511·cs.CL·October 13, 2023·5 cites

HoneyBee: Progressive Instruction Finetuning of Large Language Models for Materials Science

Yu Song, Santiago Miret, Huan Zhang, Bang Liu

PDF

Open Access 1 Repo

TL;DR

HoneyBee is a specialized large language model for materials science, developed through an instruction-based data curation process that improves data quality and model performance in this domain.

Contribution

This work introduces MatSci-Instruct, a novel instruction-based data curation method, and applies it to finetune HoneyBee, the first billion-parameter language model dedicated to materials science.

Findings

01

HoneyBee outperforms existing models on materials science tasks.

02

Iterative instruction refinement improves model performance.

03

The dataset and code are publicly available for further research.

Abstract

We propose an instruction-based process for trustworthy data curation in materials science (MatSci-Instruct), which we then apply to finetune a LLaMa-based language model targeted for materials science (HoneyBee). MatSci-Instruct helps alleviate the scarcity of relevant, high-quality materials science textual data available in the open literature, and HoneyBee is the first billion-parameter language model specialized to materials science. In MatSci-Instruct we improve the trustworthiness of generated data by prompting multiple commercially available large language models for generation with an Instructor module (e.g. Chat-GPT) and verification from an independent Verifier module (e.g. Claude). Using MatSci-Instruct, we construct a dataset of multiple tasks and measure the quality of our dataset along multiple dimensions, including accuracy against known facts, relevance to materials…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BangLab-UdeM-Mila/NLP4MatSci-HoneyBee
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Topic Modeling · Natural Language Processing Techniques