A Large Language Model for Chemistry and Retrosynthesis Predictions
Yueqing Zhang, Wentao Liu, Yan Zhang, Danyang Xiong, Jihang Zhai, Hao Hao, YuCheng Gu, HaiBo Yang, Shuanhu Gao, Lianrui Hu, Aimin Zhou, Xiao He

TL;DR
This paper introduces ECNU-ChemGPT, a specialized large language model for chemistry and retrosynthesis, utilizing domain-specific data, prompt engineering, and multi-model scheduling to outperform existing models in chemical reasoning tasks.
Contribution
The paper presents a novel chemistry-specific LLM, ECNU-ChemGPT, with advanced training strategies and dynamic multi-model integration for improved chemical knowledge understanding and retrosynthesis prediction.
Findings
Achieves 68.3% Top-1 accuracy on USPTO_50K retrosynthesis benchmark.
Reconstructs 13 complete experimental pathways for drug molecules.
Outperforms GPT-4 and other general models in chemistry tasks.
Abstract
Large language models (LLM) have achieved impressive progress across a broad range of general-purpose tasks, but their effectiveness in chemistry remains limited due to scarce domain-specific datasets and the demand for precise symbolic and structural reasoning. Here we introduce ECNU-ChemGPT(name after East China Normal University), a chemistry-specialized LLM engineered for deep chemical knowledge understanding and accurate retrosynthetic route planning. Our approach is distinguished by four key strategies: structured prompt-based knowledge distillation from authoritative chemistry textbooks to construct a high-quality question-answering dataset; domain-specific prompt engineering using curated chemical keywords, combined with LLMs APIs for data derivation and knowledge distillation; large-scale fine-tuning on a meticulously cleaned and enriched Pistachio reaction dataset to enhance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Advanced Graph Neural Networks
