UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation
Shuhan Guo, Yatao Bian, Ruibing Wang, Nan Yin, Zhen Wang, Quanming Yao

TL;DR
UniMoT is a novel unified molecule-text language model that uses a tokenizer-based approach with discrete molecule tokens, enabling effective interpretation and generation of molecules as a foreign language, achieving state-of-the-art results.
Contribution
It introduces a Vector Quantization-driven tokenizer and a shared token representation for molecules and text, unifying modalities under a single model architecture.
Findings
Achieves state-of-the-art performance on molecule comprehension tasks.
Successfully performs molecule-to-text and text-to-molecule generation.
Demonstrates effective modality unification with a new tokenizer design.
Abstract
The remarkable success of Large Language Models (LLMs) across diverse tasks has driven the research community to extend their capabilities to molecular applications. However, most molecular LLMs employ adapter-based architectures that do not treat molecule and text modalities equally and lack a supervision signal for the molecule modality. To address these issues, we introduce UniMoT, a Unified Molecule-Text LLM adopting a tokenizer-based architecture that expands the vocabulary of LLM with molecule tokens. Specifically, we introduce a Vector Quantization-driven tokenizer that incorporates a Q-Former to bridge the modality gap between molecule and text. This tokenizer transforms molecules into sequences of molecule tokens with causal dependency, encapsulating high-level molecular and textual information. Equipped with this tokenizer, UniMoT can unify molecule and text modalities under a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Chemical Synthesis and Analysis
