Learning to Group Auxiliary Datasets for Molecule
Tinglin Huang, Ziniu Hu, Rex Ying

TL;DR
This paper introduces MolGroup, a method that predicts the most beneficial auxiliary datasets for small molecule learning by combining graph structure and task similarity, thereby improving target dataset performance.
Contribution
MolGroup uniquely separates task and structure affinity to predict auxiliary dataset usefulness using a bi-level optimized routing mechanism.
Findings
Achieves an average improvement of 4.41%/3.47% on GIN/Graphormer models.
Effectively predicts optimal auxiliary dataset combinations for target datasets.
Demonstrates efficiency and effectiveness across 11 molecule datasets.
Abstract
The limited availability of annotations in small molecule datasets presents a challenge to machine learning models. To address this, one common strategy is to collaborate with additional auxiliary datasets. However, having more data does not always guarantee improvements. Negative transfer can occur when the knowledge in the target dataset differs or contradicts that of the auxiliary molecule datasets. In light of this, identifying the auxiliary molecule datasets that can benefit the target dataset when jointly trained remains a critical and unresolved problem. Through an empirical analysis, we observe that combining graph structure similarity and task similarity can serve as a more reliable indicator for identifying high-affinity auxiliary datasets. Motivated by this insight, we propose MolGroup, which separates the dataset affinity into task and structure affinity to predict the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science · Advanced biosensing and bioanalysis techniques
