Large language models, physics-based modeling, experimental measurements: the trinity of data-scarce learning of polymer properties
Ning Liu, Siavash Jafarzadeh, Brian Y. Lattimer, Shuna Ni, Jim Lua,, Yue Yu

TL;DR
This paper introduces a physics-based training pipeline for large language models that effectively addresses data scarcity in polymer property modeling by combining synthetic data generation with limited experimental data finetuning.
Contribution
The novel framework integrates physics-based synthetic data with a two-phase training strategy to improve LLM accuracy in data-scarce material property prediction tasks.
Findings
Supervised pretraining with synthetic data enhances finetuning accuracy.
Framework effectively models polymer flammability metrics with limited experimental data.
Synthetic data generation aligns LLMs with physical principles.
Abstract
Large language models (LLMs) bear promise as a fast and accurate material modeling paradigm for evaluation, analysis, and design. Their vast number of trainable parameters necessitates a wealth of data to achieve accuracy and mitigate overfitting. However, experimental measurements are often limited and costly to obtain in sufficient quantities for finetuning. To this end, we present a physics-based training pipeline that tackles the pathology of data scarcity. The core enabler is a physics-based modeling framework that generates a multitude of synthetic data to align the LLM to a physically consistent initial state before finetuning. Our framework features a two-phase training strategy: (1) utilizing the large-in-amount while less accurate synthetic data for supervised pretraining, and (2) finetuning the phase-1 model with limited experimental data. We empirically demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science
MethodsALIGN
