From Artificially Real to Real: Leveraging Pseudo Data from Large   Language Models for Low-Resource Molecule Discovery

Yuhan Chen; Nuwa Xi; Yanrui Du; Haochun Wang; Jianyu Chen; Sendong; Zhao; Bing Qin

arXiv:2309.05203·cs.CL·March 6, 2024·2 cites

From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery

Yuhan Chen, Nuwa Xi, Yanrui Du, Haochun Wang, Jianyu Chen, Sendong, Zhao, Bing Qin

PDF

Open Access 1 Repo 2 Models 1 Datasets 1 Video

TL;DR

This paper demonstrates that using pseudo data generated by Large Language Models significantly improves low-resource molecule discovery, outperforming existing methods with less data and lower costs.

Contribution

It introduces a retrieval-based prompting strategy to generate high-quality pseudo data and shows how to effectively leverage it for domain adaptation in low-resource molecule discovery.

Findings

01

Pseudo data enhances domain adaptation performance

02

Method reduces model size and training cost

03

Performance improves with increased pseudo data volume

Abstract

Molecule discovery serves as a cornerstone in numerous scientific domains, fueling the development of new materials and innovative drug designs. Recent developments of in-silico molecule discovery have highlighted the promising results of cross-modal techniques, which bridge molecular structures with their descriptive annotations. However, these cross-modal methods frequently encounter the issue of data scarcity, hampering their performance and application. In this paper, we address the low-resource challenge by utilizing artificially-real data generated by Large Language Models (LLMs). We first introduce a retrieval-based prompting strategy to construct high-quality pseudo data, then explore the optimal method to effectively leverage this pseudo data. Experiments show that using pseudo data for domain adaptation outperforms all existing methods, while also requiring a smaller model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SCIR-HI/ArtificiallyR2R
pytorchOfficial

Models

Datasets

SCIR-HI/PseudoMD-1M
dataset· 6 dl
6 dl

Videos

From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery· underline

Taxonomy

TopicsMachine Learning in Materials Science · Topic Modeling · Machine Learning in Bioinformatics