From Tokens to Materials: Leveraging Language Models for Scientific   Discovery

Yuwei Wan; Tong Xie; Nan Wu; Wenjie Zhang; Chunyu Kit; Bram Hoex

arXiv:2410.16165·cs.CL·November 5, 2024·2 cites

From Tokens to Materials: Leveraging Language Models for Scientific Discovery

Yuwei Wan, Tong Xie, Nan Wu, Wenjie Zhang, Chunyu Kit, Bram Hoex

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that domain-specific language models like MatBERT, combined with specialized tokenization, significantly improve material property prediction from scientific literature, advancing AI-driven materials discovery.

Contribution

It introduces the use of domain-specific language models and optimized tokenization techniques for better material property prediction from scientific texts.

Findings

01

MatBERT outperforms general models in material-property tasks

02

Layer 3 embeddings with context averaging are most effective

03

Specialized tokenization preserves compound information

Abstract

Exploring the predictive capabilities of language models in material science is an ongoing interest. This study investigates the application of language model embeddings to enhance material property prediction in materials science. By evaluating various contextual embedding methods and pre-trained models, including Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformers (GPT), we demonstrate that domain-specific models, particularly MatBERT significantly outperform general-purpose models in extracting implicit knowledge from compound names and material properties. Our findings reveal that information-dense embeddings from the third layer of MatBERT, combined with a context-averaging approach, offer the most effective method for capturing material-property relationships from the scientific literature. We also identify a crucial "tokenizer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MasterAI-EAM/MatEmbedding
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies