MolTextNet: A Two-Million Molecule-Text Dataset for Multimodal Molecular Learning
Yihan Zhu, Gang Liu, Eric Inae, Meng Jiang

TL;DR
MolTextNet is a large-scale, high-quality molecule-text dataset designed to enhance multimodal molecular learning, enabling better property prediction and structure retrieval through pretraining models on diverse molecular descriptions.
Contribution
We introduce MolTextNet, a 2.5 million molecule-text pair dataset with synthetic, structured descriptions, facilitating advanced multimodal models in molecular science.
Findings
Pretraining models on MolTextNet improves molecular property prediction.
The dataset enables effective structure retrieval tasks.
Synthetic descriptions enhance model understanding of molecular features.
Abstract
Small molecules are essential to drug discovery, and graph-language models hold promise for learning molecular properties and functions from text. However, existing molecule-text datasets are limited in scale and informativeness, restricting the training of generalizable multimodal models. We present MolTextNet, a dataset of 2.5 million high-quality molecule-text pairs designed to overcome these limitations. To construct it, we propose a synthetic text generation pipeline that integrates structural features, computed properties, bioactivity data, and synthetic complexity. Using GPT-4o-mini, we create structured descriptions for 2.5 million molecules from ChEMBL35, with text over 10 times longer than prior datasets. MolTextNet supports diverse downstream tasks, including property prediction and structure retrieval. Pretraining CLIP-style models with Graph Neural Networks and ModernBERT…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. MolTextNet is impressive in size (~2 million molecule-caption pairs) and could serve as a valuable resource for pre-training large multimodal models in chemistry and biomedicine. 2. The data sourcing and pipeline design are clearly explained. 3. The design considerations are reasonable, by integrating information from a wide-range structural, chemical properties and integrating with functional annotations.
1. Reliability and hallucination concerns: A significant portion of captions are generated using ChatGPT, yet the paper provides no systematic validation or human evaluation of correctness. Prior work (e.g., MolTextQA) has reported high hallucination rates which raises serious concerns about factual accuracy. Larger synthetic datasets without validation may introduce noise rather than provide meaningful supervision 2. The dataset is not available for access, and the supplementary materials do no
- The paper addresses a clear and widely acknowledged bottleneck in multimodal molecular ML: the lack of large-scale, high-quality, and informative molecule-text data. A dataset of 2.5 million pairs with rich descriptions is a substantial contribution that could unlock new modeling capabilities. - Good empirical results
- The most significant concern is that the entire text corpus is generated by an LLM. The paper trains models to align molecular graphs with a model's interpretation of chemical data, not with human-generated scientific text. - The quality control section (3.3) is procedurally robust but semantically weak. The authors check for valid SMILES, deduplicate entries, and filter based on length or missing fields. However, there is no human expert validation of the generated text. - The generator LLM
1.Presents the largest molecule–text dataset so far, combining structure, property, and synthesis aspects. 2.The data pipeline is transparent and well-documented, facilitating reproducibility. 3.Empirical results confirm consistent improvements over prior datasets in multiple downstream benchmarks.
1.The novelty is primarily engineering-driven (data expansion) rather than conceptual. 2.The paper mainly focuses on CLIP-style retrieval; broader evaluations (e.g., text-to-molecule generation or reasoning tasks) are missing. 3.No systematic human evaluation or statistical verification of data correctness is reported. 4.Limited exploration of biases across molecular domains.
- MolTextNet reports about 2.5M pairs with far longer descriptions than PubChem-300K and ChEBI-20, with explicit attempts to include structure, property, and synthesis dimensions. This improves lexical and conceptual grounding for multimodal pretraining. - The paper details how SMILES and compound names are validated, how properties and assays are collected and normalized, how numeric tokens are preserved, and how chunking is handled for ultra-long entries. This level of detail supports reuse.
- The paper relies on post-generation rules (length, token consistency) but does not quantify error rates for numeric values, assay units, or misattributed functional groups. A small-scale human or programmatic audit (e.g., SMILES-parsed substructure counts vs. described counts, unit conversions) would strengthen trust in the synthetic text. (major) - The zero-shot retrieval task is built from simple functional-group templates; it does not test richer assay semantics, property trends, or synthe
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science · Chemical Synthesis and Analysis
