From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery
Yuhan Chen, Nuwa Xi, Yanrui Du, Haochun Wang, Jianyu Chen, Sendong, Zhao, Bing Qin

TL;DR
This paper demonstrates that using pseudo data generated by Large Language Models significantly improves low-resource molecule discovery, outperforming existing methods with less data and lower costs.
Contribution
It introduces a retrieval-based prompting strategy to generate high-quality pseudo data and shows how to effectively leverage it for domain adaptation in low-resource molecule discovery.
Findings
Pseudo data enhances domain adaptation performance
Method reduces model size and training cost
Performance improves with increased pseudo data volume
Abstract
Molecule discovery serves as a cornerstone in numerous scientific domains, fueling the development of new materials and innovative drug designs. Recent developments of in-silico molecule discovery have highlighted the promising results of cross-modal techniques, which bridge molecular structures with their descriptive annotations. However, these cross-modal methods frequently encounter the issue of data scarcity, hampering their performance and application. In this paper, we address the low-resource challenge by utilizing artificially-real data generated by Large Language Models (LLMs). We first introduce a retrieval-based prompting strategy to construct high-quality pseudo data, then explore the optimal method to effectively leverage this pseudo data. Experiments show that using pseudo data for domain adaptation outperforms all existing methods, while also requiring a smaller model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning in Materials Science · Topic Modeling · Machine Learning in Bioinformatics
