Automatic Annotation Augmentation Boosts Translation between Molecules   and Natural Language

Zhiqiang Zhong; Simon Sataa-Yu Larsen; Haoyu Guo; Tao Tang; and Kuangyu Zhou; Davide Mottin

arXiv:2502.06634·cs.LG·February 11, 2025

Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language

Zhiqiang Zhong, Simon Sataa-Yu Larsen, Haoyu Guo, Tao Tang, and Kuangyu Zhou, Davide Mottin

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper presents LA$^3$, a framework that uses large language models to augment molecular annotations, significantly improving AI models for molecule translation and generation tasks in biological research.

Contribution

The paper introduces LA$^3$, a novel annotation augmentation method that enhances datasets and improves molecular translation models' performance.

Findings

01

LA$^3$ improves dataset quality with varied annotations.

02

LaMolT5 trained on augmented data outperforms state-of-the-art models.

03

Up to 301% performance improvement with LA$^3$ augmentation.

Abstract

Recent advancements in AI for biological research focus on integrating molecular data with natural language to accelerate drug discovery. However, the scarcity of high-quality annotations limits progress in this area. This paper introduces LA $^{3}$ , a Language-based Automatic Annotation Augmentation framework that leverages large language models to augment existing datasets, thereby improving AI training. We demonstrate the effectiveness of LA $^{3}$ by creating an enhanced dataset, LaChEBI-20, where we systematically rewrite the annotations of molecules from an established dataset. These rewritten annotations preserve essential molecular information while providing more varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5 based on a benchmark architecture to learn the mapping between molecular representations and augmented annotations. Experimental results on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhiqiangzhongddu/la3
none

Videos

Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language· underline

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Computational Drug Discovery Methods · Machine Learning in Materials Science

MethodsFocus