Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language
Zhiqiang Zhong, Simon Sataa-Yu Larsen, Haoyu Guo, Tao Tang, and Kuangyu Zhou, Davide Mottin

TL;DR
This paper presents LA$^3$, a framework that uses large language models to augment molecular annotations, significantly improving AI models for molecule translation and generation tasks in biological research.
Contribution
The paper introduces LA$^3$, a novel annotation augmentation method that enhances datasets and improves molecular translation models' performance.
Findings
LA$^3$ improves dataset quality with varied annotations.
LaMolT5 trained on augmented data outperforms state-of-the-art models.
Up to 301% performance improvement with LA$^3$ augmentation.
Abstract
Recent advancements in AI for biological research focus on integrating molecular data with natural language to accelerate drug discovery. However, the scarcity of high-quality annotations limits progress in this area. This paper introduces LA, a Language-based Automatic Annotation Augmentation framework that leverages large language models to augment existing datasets, thereby improving AI training. We demonstrate the effectiveness of LA by creating an enhanced dataset, LaChEBI-20, where we systematically rewrite the annotations of molecules from an established dataset. These rewritten annotations preserve essential molecular information while providing more varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5 based on a benchmark architecture to learn the mapping between molecular representations and augmented annotations. Experimental results on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Computational Drug Discovery Methods · Machine Learning in Materials Science
MethodsFocus
