Learning to Group Auxiliary Datasets for Molecule

Tinglin Huang; Ziniu Hu; Rex Ying

arXiv:2307.04052·q-bio.BM·November 10, 2023·2 cites

Learning to Group Auxiliary Datasets for Molecule

Tinglin Huang, Ziniu Hu, Rex Ying

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces MolGroup, a method that predicts the most beneficial auxiliary datasets for small molecule learning by combining graph structure and task similarity, thereby improving target dataset performance.

Contribution

MolGroup uniquely separates task and structure affinity to predict auxiliary dataset usefulness using a bi-level optimized routing mechanism.

Findings

01

Achieves an average improvement of 4.41%/3.47% on GIN/Graphormer models.

02

Effectively predicts optimal auxiliary dataset combinations for target datasets.

03

Demonstrates efficiency and effectiveness across 11 molecule datasets.

Abstract

The limited availability of annotations in small molecule datasets presents a challenge to machine learning models. To address this, one common strategy is to collaborate with additional auxiliary datasets. However, having more data does not always guarantee improvements. Negative transfer can occur when the knowledge in the target dataset differs or contradicts that of the auxiliary molecule datasets. In light of this, identifying the auxiliary molecule datasets that can benefit the target dataset when jointly trained remains a critical and unresolved problem. Through an empirical analysis, we observe that combining graph structure similarity and task similarity can serve as a more reliable indicator for identifying high-affinity auxiliary datasets. Motivated by this insight, we propose MolGroup, which separates the dataset affinity into task and structure affinity to predict the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

graph-and-geometric-learning/molgroup
pytorch

Videos

Learning to Group Auxiliary Datasets for Molecule· slideslive

Taxonomy

TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science · Advanced biosensing and bioanalysis techniques