Machine Learning Phase Classification of Thermoelectric Materials
Chung T. Ma, S. Joseph Poon

TL;DR
This paper uses machine learning to quickly classify phases of thermoelectric materials, helping speed up their development.
Contribution
A Support Vector Machine model is introduced for efficient phase classification of thermoelectric alloys.
Findings
The SVM model achieves prediction accuracies between 77% and 92% for thermoelectric alloy phases.
Cross-validation confirms the model's robustness in differentiating various TE phases.
Abstract
In this study, we employ a Support Vector Machine (SVM) model to efficiently classify the phases of thermoelectric (TE) alloys. While ab initio calculations and experiments have explored the phases of functional TE materials, the large variety of alloys makes these explorations time-consuming and expensive. Therefore, there is a critical need for time-efficient methods to accelerate the discovery and development of new TE materials. Recently, machine learning (ML) classification models have been applied to predict material phases, including those of multi-principal element alloys. Using an SVM to classify phases of TE alloys, our results demonstrate that the model achieves prediction accuracies ranging from 77% to 92%. Additionally, cross-validation across various TE phases is performed to demonstrate the model’s robustness in phase differentiation. This work offers a time-efficient…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Thermoelectric Materials and Devices · Machine Learning in Materials Science · Thermal properties of materials
1. Introduction
Conventional energy generation loses a considerable amount of energy to waste heat [1]. Transforming this waste heat into electricity will significantly advance the efficiency of energy production. Thermoelectric (TE) technology has the goal of turning this waste heat into electricity [2,3,4,5]. In addition, thermoelectric materials have been investigated for improving battery thermal management, which can optimize the efficiency and lifespan of these systems [6,7]. A wide range of materials has been investigated for thermoelectric applications, including Half-Heusler (HH) compounds [8,9,10,11], -based alloys [12,13,14], Ge, Pb, transition metal (TM) chalcogenides [15,16,17,18,19,20,21,22,23], oxides [24,25,26], and (Si or Sb)-based alloys [27,28,29]. The phase formations of these materials can greatly affect their thermoelectric properties. In this context, phase formation refers to the development of specific crystal structures. For instance, in Half-Heusler (HH) compounds, it corresponds to the formation of the XYZ face-centered cubic (FCC) structure [9]. Many ab initio calculations and experiments have investigated phase formations in thermoelectric alloys. However, with many types of thermoelectric materials to choose from, identifying and differentiating their phases can be time-consuming and costly.
To this end, recent studies have employed machine learning (ML) models as an efficient method to identify phase formation in various compositionally complex or high-entropy alloys, correspondingly known as CCAs or HEAs. These phases include B2, face-centered cubic (FCC), body-centered cubic (BCC), hexagonal, and amorphous. In these studies, databases of materials have been constructed using both experimental data and first-principles calculations [30,31,32,33,34,35,36,37,38]. Both elemental parameters and alloy parameters have been used as descriptors in these models. Feature selection techniques [31,32,33,39], including correlation coefficients and wrapper methods, have been applied to select relevant raw features. Furthermore, feature engineering methods, such as using math variations and one-hot encoding, have been employed to enhance the model’s performance. Different kinds of classification models [30,31,32,33,34,35,36,37,38], including Support Vector Machine (SVM), random forest (RF), neural network (NN), and gradient boosting machine (GBM), have been developed. These models have been used to categorize various types of alloys, including multi-principal element alloys and high-entropy alloys. Some of these models have shown excellent predictive ability, achieving an accuracy of over . Furthermore, regression models have predicted numerous thermoelectric properties, including figure of merit ZT, Seebeck Coefficient S, and thermal conductivity [40,41,42,43,44,45,46]. Some of these models have achieved a coefficient of determination, , of over 0.90. In addition, the application of machine learning to a complex and diverse material system parallels approaches in other fields, for example, the use of physical parameters to optimize traffic flow dynamics [47].
While several studies have employed classification models to investigate phase formation in complex alloys, and regression models to predict thermoelectric properties, there is a noticeable lack of research specifically focused on the phase classification of thermoelectric materials. As previously discussed, the phase of a thermoelectric material plays a critical role in determining its thermoelectric performance. Therefore, developing a time-efficient and cost-effective ML classification approach to distinguish between different phases of thermoelectric materials can provide essential insights. Such an approach would complement existing regression models by offering valuable guidance for the design and discovery of high-performance thermoelectric materials.
To address this problem, we focus on distinguishing the phases of various thermoelectric materials in this study. We construct databases to identify the phase formations of different groups of thermoelectric materials. These groups include Half-Heusler (HH) compounds, which form an FCC structure; (Si or Sb)-based alloys, which form a hexagonal structure; -based alloys, which form a rhombohedral structure; transition metal (TM) chalcogenides, which generally exhibit a hexagonal structure; and (Pb, Sn, Ge) chalcogenides, which may adopt hexagonal, rhombohedral, or cubic structures. For oxide-based thermoelectric materials, we classify them into four structural categories: hexagonal, perovskite, orthorhombic, and rhombohedral. Using a Support Vector Machine (SVM) with previously developed alloy parameters [33], we classify various phases of thermoelectric materials. To further enhance model accuracy, we create a new set of raw features by incorporating additional elemental parameters alongside the alloy parameters. Moreover, we evaluate the model’s ability to distinguish between different phases through cross-validation across multiple thermoelectric material databases. The results demonstrate the model’s effectiveness in classifying thermoelectric phases that exhibit specific thermoelectric properties. Thus, the model developed herein provides an important grounding for understanding the structure–property relationships essential in the development of future thermoelectric materials.
2. Methods
In this work, we adopt the ML phase classification models of Qi et al. with appropriate modifications [33] to classify thermoelectric crystal phases. The overall process of this method is illustrated in the flowchart shown in Figure 1. TE materials are grouped into databases based on their material classes and phases. These include Half-Heusler (HH), (Si or Sb)-based alloys, -based alloys, TM chalcogenides, (Pb, Sn, Ge) chalcogenides, and various oxides. Their respective crystal structures are as follows: HHs are FCC; (Si or Sb)-based alloys are hexagonal; -based alloys are rhombohedral, which consists of , , and doped derivatives of [12]; TM chalcogenides are hexagonal; (Pb, Sn, Ge) chalcogenides are either rhombohedral or cubic; and various oxides contain hexagonal, perovskite, or orthorhombic structures. Each model is trained independently on a specific phase-type database, such as the HH database. In this framework, an alloy known to form the HH phase is labeled as not forming the other phases. As a result, each model can only predict whether a given composition will form the specific phase it was trained on. For example, a model trained on the HH database will predict whether a composition can form the HH phase, regardless of whether the same composition is predicted to form (or not form) other phases by models trained on different databases. For oxides, the model can only distinguish whether a given composition forms a hexagonal phase, but not between different hexagonal phases. Random forest (RF) and Support Vector Machine (SVM) classification models are trained to categorize the phase formations of TE alloys. Only the results of SVM classification models are shown herein, because SVM models show mostly higher accuracy than RF models. The accuracy used to evaluate the models’ performance is the overall accuracy. For detailed comparisons between different ML classification models, refer to previous work by Qi et al. [33]. During each training, of the data is used for training and the other is for validations. This process is repeated ten times.
For feature selection, we start with raw features obtained from various thermodynamic and Hume-Rothery parameters [33]. These features are mixing entropy , mixing enthalpy (obtained from Miedama’s model) [48], , , , , radius mismatch , / , electronegativity mismatch , and mean valence electron concentration VEC. The definitions of these raw features are listed in Table 1. We also utilize elemental parameters, obtained from Matminer [49], as raw features. These elemental parameters include covalent radius, first ionization energy, and Mendeleev number. For a given alloy, the elemental features also include the minimum, maximum, weighted average, standard deviation of the mean, and range of each elemental parameter. In total, 10 alloy parameters and 15 elemental parameters serve as the raw features in this work.
Feature engineering is employed to improve the performance of the machine learning model [33]. First, a new set of features is constructed from raw features X, using mathematical variations , , , ln(x), and . Then, the set of features is further expanded by grouping two mathematical variances, A and B, using the following arithmetic operations: A+B, A-B, A/B, and AB. To filter the expanded set of features, the Pearson Correlation Coefficient (PCC) is employed. For any feature pair with |PCC| > 0.9, only one is kept in the model. By doing this, any feature pairs that are strongly correlated, both positive and negative, are filtered down to one. Then, a logistic regression with L1 (or Lasso) regularization is used to directly select important features and eliminate useless features. This selection is achieved by minimizing the total prediction penalty, which is a trade-off between reducing prediction error and regulating the number of selected features. Finally, a sequential learning algorithm selects the best features by minimizing the average prediction error from thirty rounds of five-fold cross-validation. For each round, the top feature is selected for ML. After these steps, only the top features are used for phase classifications in this work. For alloy parameters, the top five features are , , , , and . With the addition of elemental parameters, the top five features are , , x Mendeleev number (weighted average), , and .
3. Results and Discussion
First, we examine the accuracy of phase classifications using alloy parameters as raw features. As shown in Table 2, using SVM, the accuracy ranges from to across material groups. In comparison, RF yields accuracy ranging from to , which justifies the selection of SVM for this study. The listed accuracy and the range of accuracy represent the average prediction accuracy and the range from ten repeated calculations with different random seeds, respectively, which play a role in feature selection, feature engineering, and the ML classification algorithm. The highest accuracy is for the prediction HH phase, with an accuracy of . Then, there is a rather noticeable drop in accuracy to to for predicting (Si or Sb)-based alloys, -based alloys, and TM chalcogenides. A further drop in accuracy to is seen in predicting (Pb, Sn, Ge) chalcogenides. The lowest accuracy is for predicting the oxides, which has accuracy for perovskites and the orthorhombic phase, and accuracy for the hexagonal and rhombohedral phases. We also examine specific examples where the model performs well and where it is less successful. For example, for HH compounds, the model accurately predicts the formation of the HH phase for and various doped alloys of , such as and . However, the model fails to predict the HH phase formation for . From these results, the model can predict the inter-metallic HH phase with the highest accuracy. However, when the alloy group contains more small-group (non-metallic) elements, such as Si and Te, the accuracy decreases to around . Further decreases are seen in the oxides, which contain non-metal O. A possible explanation for this decrease in accuracy is the incomplete alloy parameters for some of these semi-metal or non-metal alloys. While parameters, such as entropy S, are well-defined for any given alloy, other parameters, including mixing enthalpy H, are estimations, which can be inaccurate. Furthermore, parameters, such as melting temperature , can vary greatly for different i-j element pairs. Another plausible reason is that HH compounds form a well-defined FCC structure. In contrast, other material groups, such as chalcogenides, can adopt multiple crystal structures, include hexagonal, rhombohedral, or cubic, making their phase identification more challenging. Despite these limitations, the model is still able to predict the phase formation with a reasonable prediction accuracy of or above.
Then, we include several elemental parameters, including covalent radius, first ionization energy, and Mendeleev number, to re-examine the model. By incorporating elemental parameters with alloy parameters, the prediction accuracy increases by to for all material groups, as shown in Table 3. Starting with the HH group, the prediction accuracy increases from to . For (Si or Sb)-based alloys, the prediction accuracy increases from to . For -based alloys, the prediction accuracy increases from to . For TM chalcogenide alloys, the prediction accuracy increases from to . For (Pb, Ge, or Sn) chalcogenide alloys, the prediction accuracy increases from to . For the oxides, the hexagonal phase increases from to , the perovskite phase increases from to , the orthorhombic phase increases from to , and the rhombohedral phase increases from to . These increases in prediction accuracy can be attributed to the inclusion of more well-defined elemental parameters in the model and the incorporation of some missing physical concepts in alloy parameters. These physical concepts, such as covalent radius and first ionization energy, can play a key role in phase formation. Thus, including these parameters can improve the model. For HH, since the original model with alloy parameters can already predict well with well-defined alloy parameters, the increase in prediction accuracy is marginal, from to . More notable improvements of to are found in the oxide group, which originally contained more estimated parameters. Thus, incorporating elemental parameters has a greater influence on the oxide group. Overall, by using alloy and elemental parameters with feature engineering, the model achieves prediction accuracies of or higher across the nine different material groups.
To examine the model’s ability to distinguish between material groups, we use cross-validation to check the model. In this cross-validation, each target material group is tested using models trained on a different material dataset. Table 4 shows the results of this cross-validation. As shown in Table 4, the diagonal terms are the accuracy of the targeted material group trained using the respective dataset. These accuracies are the same as those obtained in Table 2, because they are the same model. The off-diagonal terms are the false positive rate of targeted materials trained using a different dataset. In other words, it represents the percentage of alloys from another material group that a given model falsely predicts as belonging to the trained material group. For example, in the HH column and (Si, Sb)-based row, the false positive rate is 0.05. This means that when we use a model trained with HH datasets, and let the model predict if alloys are from the (Si, Sb)-based dataset, the model falsely predicts of those materials will form the HH phase. Looking at Table 4, for the majority of these cross-validations, the false positive rate ranges from 0.01 to 0.10. The material groups that these models have trouble distinguishing are between the TM chalcogenides and (Pb, Ge, or Sn) chalcogenides, where the false positive rate reaches 0.21 and 0.28 using the model trained by (Pb, Ge, or Sn) chalcogenides to predict TM chalcogenides and using the model trained by TM chalcogenides to predict (Pb, Ge, or Sn) chalcogenides, respectively. This is likely due to the fact that they are overlapped in the material phase space between these two material groups, as both contain chalcogen elements (S, Se, or Te). In addition, both chalcogenide groups share similar crystal structures; as mentioned earlier, both can adopt hexagonal, rhombohedral, or cubic structures, which may also contribute to the model’s confusion between these two groups. For oxides, when presented with an oxide, the model attempts to classify it into one of the four oxide phase types included in this study: hexagonal, perovskites, orthorhombic, or rhombohedral. Ideally, the model would be able to accurately differentiate among all four categories. However, as shown in Table 2 and Table 3, the overall prediction accuracy for oxides ranges from to . As a result, there are instances of false positives, where the model predicts the formation of multiple oxide phases for a single material. Addressing this limitation will require future experimental validation and more precise ab initio calculations to obtain detailed alloy-specific parameters, which could enhance the model’s predictive accuracy. From this cross-validation, the models show robustness in identifying and distinguishing different TE phases.
4. Conclusions
We have employed Support Vector Machine (SVM) to predict phases of thermoelectric (TE) alloys, with the goal of identifying and distinguishing different TE phases so that specific phases can be predicted correctly. Our initial model, using only alloy parameters, achieved accuracies ranging from to . With the incorporation of additional elemental parameters, the accuracies improved to between and . To further evaluate the model’s robustness, we performed cross-validation across various TE material groups. Notably, the model achieved a low false positive rate of 0.01 when predicting whether chalcogenide or oxide alloys would incorrectly form the HH phase, and vice versa. However, the model struggled to distinguish between transition metal (TM) chalcogenides and (Pb, Ge, or Sn)-based chalcogenides, with false positive rates reaching 0.21 and 0.28, respectively. With future experimental validation and more accurate ab initio calculations, the precision of alloy parameters can be significantly improved, which is expected to enhance the model’s performance. Overall, this study provides an important step toward the reliable identification of phase formations in TE alloys, which serve as a critical foundation for the design and discovery of high-performance TE materials for future energy applications.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1He J. Tritt T.M. Advances in thermoelectric materials research: Looking back and moving forward Science 2017357 eaak 999710.1126/science.aak 999728963228 · doi ↗ · pubmed ↗
- 2Wei J. Yang L. Ma Z. Song P. Zhang M. Ma J. Yang F. Wang X. Review of current high-ZT thermoelectric materials J. Mater. Sci.202055126421270410.1007/s 10853-020-04949-0 · doi ↗
- 3Shi X.L. Zou J. Chen Z.G. Advanced Thermoelectric Design: From Materials and Structures to Devices Chem. Rev.20201207399751510.1021/acs.chemrev.0c 0002632614171 · doi ↗ · pubmed ↗
- 4Hasan M.N. Wahid H. Nayan N. Mohamed Ali M.S. Inorganic thermoelectric materials: A review Int. J. Energy Res.2020446170622210.1002/er.5313 · doi ↗
- 5Mukherjee M. Srivastava A. Singh A.K. Recent advances in designing thermoelectric materials J. Mater. Chem. C 202210125241255510.1039/D 2TC 02448 A · doi ↗
- 6Qi W. Lan P. Yang J. Chen Y. Zhang Y. Wang G. Peng F. Hong J. Multi-U-Style micro-channel in liquid cooling plate for thermal management of power batteries Appl. Therm. Eng.202425612398410.1016/j.applthermaleng.2024.123984 · doi ↗
- 7Qi W. Yang J. Zhang Z. Wu J. Lan P. Xiang S. Investigation on thermal management of cylindrical lithium-ion batteries based on interwound cooling belt structure Energy Convers. Manag.202534011996210.1016/j.enconman.2025.119962 · doi ↗
- 8Zhu T. Fu C. Xie H. Liu Y. Zhao X. High Efficiency Half-Heusler Thermoelectric Materials for Energy Harvesting Adv. Energy Mater.20155150058810.1002/aenm.201500588 · doi ↗
