Preference Learning from Physics-Based Feedback: Tuning Language Models to Design BCC/B2 Superalloys
Satanu Ghosh, Collin Holgate, Neal R. Brodnik, Doug Downey, Samantha Daly, Tresa M. Pollock, Samuel Carton

TL;DR
This paper demonstrates how language models can be optimized using physics-based preference learning to design novel BCC/B2 superalloys, a new approach that combines scientific calculations with AI for materials discovery.
Contribution
It introduces a physics-grounded preference tuning method for language models applied to structural alloy design, expanding AI's role in scientific material discovery.
Findings
Language models can be optimized for alloy design objectives.
Physics-based reward signals outperform heuristic feedback.
First use of physics-grounded preference learning in materials science.
Abstract
We apply preference learning to the task of language model-guided design of novel structural alloys. In contrast to prior work that focuses on generating stable inorganic crystals, our approach targets the synthesizeability of a specific structural class: BCC/B2 superalloys, an underexplored family of materials with potential applications in extreme environments. Using three open-weight models (LLaMA-3.1, Gemma-2, and OLMo-2), we demonstrate that language models can be optimized for multiple design objectives using a single, unified reward signal through Direct Preference Optimization (DPO). Unlike prior approaches that rely on heuristic or human-in-the-loop feedback (costly), our reward signal is derived from thermodynamic phase calculations, offering a scientifically grounded criterion for model tuning. To our knowledge, this is the first demonstration of preference-tuning a language…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- A major strength of this study lies in its focus on the important challenge of new materials discovery using generative AI, representing a true integration of scientific research and AI technology. In particular, by applying language models to materials design, the study demonstrates the potential for knowledge-integrated materials exploration that goes beyond conventional data-driven approaches. - The approach of fine-tuning large language models (LLMs) with feedback from physical simulation
- This study presents a highly application-oriented approach to materials design that combines physical simulations with large language models (LLMs), but it offers limited novelty in terms of machine learning methodology itself. Therefore, it may be more suitable for publication in a specialized journal in the fields of Materials Informatics or Computational Materials Science, rather than at a top-tier machine learning conference such as ICLR, which prioritizes methodological innovation. - Whi
1. Instead of materials generation in general, the paper targets “one-shot generation of a BCC composition, a B2 composition, and a B2 volume fraction” for a concrete, high-value family of alloys (BCC matrix + ordered B2 precipitates), which makes the problem easy to evaluate. 2. Using physics-generated preferences (from Thermo-Calc) to do DPO on top of an SFT model is a reasonable and novel adaptation of recent preference-learning/LLM-alignment techniques to materials design. 3. The paper act
1. The same thermo-calc setup is used to (1) synthesize SFT data, (2) generate preference pairs, and (3) evaluate success. This makes it hard to tell whether the model learned transferable “materials knowledge” or just learned to speak to this particular CALPHAD database. Cross-simulator or literature-based sanity checks are missing. 2. Only some base models (e.g., LLaMA, Gemma) improve stably after physics-DPO, while others (e.g., OLMo-2) degrade, which suggests the pipeline is sensitive to m
1. The paper is clearly written and well-motivated, convincingly arguing for a shift from optimizing simple stability to complex engineering utility. 2. It demonstrates that preference learning is a promising pathway to achieve this, successfully aligning language models with physics-grounded, multi-objective design goals.
1. The study's core contribution, preference learning via DPO, yielded only modest gains over the SFT baseline. This method proved inconsistent, as it failed on one of the three test models (OLMo), which showed significant performance degradation. This undermines the claim of a successfully applied and robust preference learning framework. 2. The paper only tested DPO and failed to explore advanced methods such as GRPO, which is widely used in training recent reasoning models. This is a signifi
DPO demonstrates a potentially useful approach to physics-based guidance to language models.
- The demonstrated use case, generating compositions and their volume fractions, is an oversimplified one in materials design, therefore not showing much value. - The experiments and evaluations are not convincing (see Q1–2). - The reward function relies heavily on heuristics, limiting general applicability (see Q3–4).
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · High Temperature Alloys and Creep · Catalysis and Oxidation Reactions
