Incorporating Domain Knowledge into Materials Tokenization
Yerim Oh, Jun-Hyung Park, Junho Kim, SungHo Kim, SangKeun Lee

TL;DR
This paper introduces MATTER, a domain knowledge-integrated tokenization method for materials science language models, which preserves material concept integrity and improves performance over traditional methods.
Contribution
We propose MATTER, a novel tokenization approach that incorporates material domain knowledge to maintain semantic and structural integrity during tokenization.
Findings
MATTER outperforms existing tokenization methods with 4% and 2% gains in generation and classification tasks.
Incorporating domain knowledge reduces token fragmentation and semantic loss.
Experimental results validate the effectiveness of MATTER in materials science text processing.
Abstract
While language models are increasingly utilized in materials science, typical models rely on frequency-centric tokenization methods originally developed for natural language processing. However, these methods frequently produce excessive fragmentation and semantic loss, failing to maintain the structural and semantic integrity of material concepts. To address this issue, we propose MATTER, a novel tokenization approach that integrates material knowledge into tokenization. Based on MatDetector trained on our materials knowledge base and a re-ranking method prioritizing material concepts in token merging, MATTER maintains the structural integrity of identified material concepts and prevents fragmentation during tokenization, ensuring their semantic meaning remains intact. The experimental results demonstrate that MATTER outperforms existing tokenization methods, achieving an average…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsManufacturing Process and Optimization · Additive Manufacturing and 3D Printing Technologies · Injection Molding Process and Properties
