Exploring the Benefits of Domain-Pretraining of Generative Large   Language Models for Chemistry

Anurag Acharya; Shivam Sharma; Robin Cosbey; Megha Subramanian; Scott; Howland; Maria Glenski

arXiv:2411.03542·cs.CL·November 7, 2024

Exploring the Benefits of Domain-Pretraining of Generative Large Language Models for Chemistry

Anurag Acharya, Shivam Sharma, Robin Cosbey, Megha Subramanian, Scott, Howland, Maria Glenski

PDF

Open Access

TL;DR

This paper investigates the impact of domain-specific pretraining of large language models on chemistry tasks, showing that in-domain fine-tuning significantly improves performance over generic models in scientific NLP applications.

Contribution

It demonstrates the benefits of in-domain pretraining and instruction fine-tuning for large language models applied to chemistry, highlighting improved task performance.

Findings

01

In-domain models perform well in zero-shot chemistry tasks.

02

Instruction fine-tuning enhances performance on chemistry-specific tasks.

03

In-domain adaptation outperforms off-the-shelf models in scientific NLP.

Abstract

A proliferation of Large Language Models (the GPT series, BLOOM, LLaMA, and more) are driving forward novel development of multipurpose AI for a variety of tasks, particularly natural language processing (NLP) tasks. These models demonstrate strong performance on a range of tasks; however, there has been evidence of brittleness when applied to more niche or narrow domains where hallucinations or fluent but incorrect responses reduce performance. Given the complex nature of scientific domains, it is prudent to investigate the trade-offs of leveraging off-the-shelf versus more targeted foundation models for scientific domains. In this work, we examine the benefits of in-domain pre-training for a given scientific domain, chemistry, and compare these to open-source, off-the-shelf models with zero-shot and few-shot prompting. Our results show that not only do in-domain base models perform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Cosine Annealing · Adam · Attention Is All You Need · Attention Dropout · Multi-Head Attention · Weight Decay · Byte Pair Encoding · Dropout