# Word embeddings as autonomous predictors in materials design—the effect of inherent variability on information transfer

**Authors:** Jana Radaković, Katarina Batalović, Nikola Novaković

PMC · DOI: 10.1186/s13321-025-01149-3 · Journal of Cheminformatics · 2026-02-09

## TL;DR

This paper explores how word embeddings of atoms from scientific literature can be used to predict material properties, finding that variability in embeddings does not hinder their effectiveness.

## Contribution

The study quantifies variability in word embeddings of chemical elements and shows it does not hinder predictive performance in materials design.

## Key findings

- Substantial variability exists in atomic embeddings due to vocabulary choices in training data.
- Variability in embeddings does not prevent accurate prediction of compound stability using regression models.
- Dimensional reduction stabilizes compound vector representations without losing predictive power.

## Abstract

We propose that word embeddings of atoms derived from scientific literature are revisited as autonomous machine learning predictors in materials design. If static word embeddings encode comprehensive physicochemical information, joined embeddings of chemical elements constituting a chemical compound represent a viable source of physicochemical knowledge. Nevertheless, static word embeddings are susceptible to variability due to information heterogeneity within training material. We analysed whether variability occurs in embeddings affiliated with physicochemical entities, including explicit atoms, and whether it affects therein-encoded domain-specialized information or inhibits the information transfer. Results demonstrate the substantial variability in individual atomic embeddings, which is highly dependent on vocabulary terms selected for language modelling. Regardless, variability does not obstruct the mapping of materials' composite predictors into physicochemical properties when joined atomic embeddings are implemented within a regression model estimating the compound stability by predicting its formation energy. Moreover, the encoded information and the model's predictive performance maintained stability following compound vector calibration via dimensional reduction.

Scientific contribution

The magnitude of variability in word embeddings of physicochemical entities, including chemical elements, occurring due to information heterogeneity in complementary training material of materials science, chemistry, and physics scientific literature was observed and quantified. The research shows that notable variability of vectorial representations of chemical elements does not obstruct the underlying statistical properties, nor does it inhibit the information transfer. Accordingly, regardless of their origin, conjoined atomic embeddings representing chemical compounds facilitate stable predictive performance when implemented within a regression model.

## Full-text entities

- **Genes:** INS (insulin) [NCBI Gene 3630] {aka IDDM, IDDM1, IDDM2, ILPR, IRDN, MODY10}
- **Diseases:** hallucinations (MESH:D006212)
- **Chemicals:** CO2 (MESH:D002245), H2O (MESH:D014867), americium (MESH:D000576), buckyball (MESH:D037741), lithium (MESH:D008094), Hydrogen (MESH:D006859), glucose (MESH:D005947), He (MESH:D006371), actinide (MESH:D008671), Graphite (MESH:D006108), C60 - Buckminsterfullerene (-), ozone (MESH:D010126), hydrocarbon (MESH:D006838), carbon (MESH:D002244), ethyne (MESH:D000114)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Mutations:** C62N

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12888715/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12888715/full.md

## References

1 references — full list in the complete paper: https://tomesphere.com/paper/PMC12888715/full.md

---
Source: https://tomesphere.com/paper/PMC12888715