Invariant Tokenization of Crystalline Materials for Language Model Enabled Generation
Keqiang Yan, Xiner Li, Hongyi Ling, Kenna Ashen, Carl Edwards,, Raymundo Arr\'oyave, Marinka Zitnik, Heng Ji, Xiaofeng Qian, Xiaoning Qian,, Shuiwang Ji

TL;DR
This paper introduces Mat2Seq, a novel method for converting 3D crystal structures into unique 1D sequences that are invariant under SE(3) and periodic transformations, improving crystal generation with language models.
Contribution
Mat2Seq is the first approach to ensure SE(3) and periodic invariance in crystal structure sequences, enabling more accurate crystal generation with language models.
Findings
Mat2Seq achieves invariance and uniqueness in crystal sequences.
Experimental results show improved crystal generation performance.
Mat2Seq outperforms prior sequence conversion methods.
Abstract
We consider the problem of crystal materials generation using language models (LMs). A key step is to convert 3D crystal structures into 1D sequences to be processed by LMs. Prior studies used the crystallographic information framework (CIF) file stream, which fails to ensure SE(3) and periodic invariance and may not lead to unique sequence representations for a given crystal structure. Here, we propose a novel method, known as Mat2Seq, to tackle this challenge. Mat2Seq converts 3D crystal structures into 1D sequences and ensures that different mathematical descriptions of the same crystal are represented in a single unique sequence, thereby provably achieving SE(3) and periodic invariance. Experimental results show that, with language models, Mat2Seq achieves promising performance in crystal structure generation as compared with prior methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
