AtomDisc: An Atom-level Tokenizer that Boosts Molecular LLMs and Reveals Structure--Property Associations
Mingxu Zhang, Dazhong Shen, Ying Sun

TL;DR
AtomDisc introduces a fine-grained atom-level tokenizer for molecular LLMs, capturing local environments to improve property prediction, molecular generation, and interpretability in chemical reasoning.
Contribution
It presents a novel atom-level tokenization framework that encodes local atomic environments, enhancing molecular LLMs' ability to understand structure-property relationships.
Findings
Achieves state-of-the-art performance in property prediction
Improves molecular generation quality
Reveals meaningful structure-property associations
Abstract
Advances in large language models (LLMs) are accelerating discovery in molecular science. However, adapting molecular information to the serialized, token-based processing of LLMs remains a key challenge. Compared to other representations, molecular graphs explicitly encode atomic connectivity and local topological environments, which are key determinants of atomic behavior and molecular properties. Despite recent efforts to tokenize overall molecular topology, there still lacks effective fine-grained tokenization of local atomic environments, which are critical for determining sophisticated chemical properties and reactivity. To address these issues, we introduce AtomDisc, a novel framework that quantizes atom-level local environments into structure-aware tokens embedded directly in LLM's token space. Our experiments show that AtomDisc, in a data-driven way, can distinguish chemically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Advanced Graph Neural Networks
