Large language models, physics-based modeling, experimental   measurements: the trinity of data-scarce learning of polymer properties

Ning Liu; Siavash Jafarzadeh; Brian Y. Lattimer; Shuna Ni; Jim Lua,; Yue Yu

arXiv:2407.02770·cs.LG·July 4, 2024

Large language models, physics-based modeling, experimental measurements: the trinity of data-scarce learning of polymer properties

Ning Liu, Siavash Jafarzadeh, Brian Y. Lattimer, Shuna Ni, Jim Lua,, Yue Yu

PDF

Open Access

TL;DR

This paper introduces a physics-based training pipeline for large language models that effectively addresses data scarcity in polymer property modeling by combining synthetic data generation with limited experimental data finetuning.

Contribution

The novel framework integrates physics-based synthetic data with a two-phase training strategy to improve LLM accuracy in data-scarce material property prediction tasks.

Findings

01

Supervised pretraining with synthetic data enhances finetuning accuracy.

02

Framework effectively models polymer flammability metrics with limited experimental data.

03

Synthetic data generation aligns LLMs with physical principles.

Abstract

Large language models (LLMs) bear promise as a fast and accurate material modeling paradigm for evaluation, analysis, and design. Their vast number of trainable parameters necessitates a wealth of data to achieve accuracy and mitigate overfitting. However, experimental measurements are often limited and costly to obtain in sufficient quantities for finetuning. To this end, we present a physics-based training pipeline that tackles the pathology of data scarcity. The core enabler is a physics-based modeling framework that generates a multitude of synthetic data to align the LLM to a physically consistent initial state before finetuning. Our framework features a two-phase training strategy: (1) utilizing the large-in-amount while less accurate synthetic data for supervised pretraining, and (2) finetuning the phase-1 model with limited experimental data. We empirically demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science

MethodsALIGN