Enhancing Protein Predictive Models via Proteins Data Augmentation: A Benchmark and New Directions
Rui Sun, Lirong Wu, Haitao Lin, Yufei Huang, Stan Z. Li

TL;DR
This paper benchmarks existing protein data augmentation techniques, introduces two novel semantic-aware methods, and proposes an adaptive framework that improves performance across multiple protein-related tasks.
Contribution
It extends image and text augmentation methods to proteins, introduces new semantic augmentation techniques, and develops an adaptive framework for optimal augmentation selection.
Findings
APA improves performance by an average of 10.55% across tasks.
Semantic-aware augmentations enhance biological relevance.
Benchmark results demonstrate effectiveness of proposed methods.
Abstract
Augmentation is an effective alternative to utilize the small amount of labeled protein data. However, most of the existing work focuses on design-ing new architectures or pre-training tasks, and relatively little work has studied data augmentation for proteins. This paper extends data augmentation techniques previously used for images and texts to proteins and then benchmarks these techniques on a variety of protein-related tasks, providing the first comprehensive evaluation of protein augmentation. Furthermore, we propose two novel semantic-level protein augmentation methods, namely Integrated Gradients Substitution and Back Translation Substitution, which enable protein semantic-aware augmentation through saliency detection and biological knowledge. Finally, we integrate extended and proposed augmentations into an augmentation pool and propose a simple but effective framework, namely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Machine Learning in Bioinformatics · Genetics, Bioinformatics, and Biomedical Research
MethodsAdaptive Pseudo Augmentation
