Chemistry Integrated Language Model using Hierarchical Molecular Representation for Polymer Informatics
Jihun Ahn, Gabriella Pasya Irianti, Vikram Thapar, Su-Mi Hur

TL;DR
This paper introduces CI-LLM, a novel polymer-focused language model that uses hierarchical molecular representations to improve property prediction and enable inverse design, overcoming data scarcity issues in polymer informatics.
Contribution
The paper presents a new molecular representation and transformer-based framework that enhances polymer property prediction and inverse design, with faster inference and interpretable insights.
Findings
De$^3$BERTa achieves 3.5x faster inference with improved accuracy.
The model attains 100% scaffold retention in inverse design.
Successful multi-property optimization for negatively correlated objectives.
Abstract
Machine learning has transformed material discovery for inorganic compounds and small molecules, yet polymers remain largely inaccessible to these methods. While data scarcity is often cited as the primary bottleneck, we demonstrate that strategic molecular representations can overcome this limitation. We introduce CI-LLM (Chemically Informed Language Model), a framework combining HAPPY (Hierarchically Abstracted rePeat unit of PolYmer), which encodes chemical substructures as tokens, with numerical descriptors within transformer architectures. For property prediction, DeBERTa, our descriptor-enriched encoder, achieves 3.5x faster inference than SMILES-based models with improved accuracy ( score gains of 0.9-4.1 percent across four properties), while providing interpretable structure-property insights at the subgroup level. For inverse design, our GPT-based generator produces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Advanced Graph Neural Networks · Computational Drug Discovery Methods
