Data Augmentation for Compositional Data: Advancing Predictive Models of the Microbiome
Elliott Gordon-Rodriguez, Thomas P. Quinn, John P. Cunningham

TL;DR
This paper introduces new data augmentation strategies tailored for microbiome compositional data, leading to improved predictive performance and a novel contrastive learning model, setting new benchmarks in disease prediction tasks.
Contribution
It develops and applies novel augmentation methods for compositional data, advancing microbiome analysis and representation learning.
Findings
Achieved state-of-the-art results in disease prediction tasks.
Enhanced model performance across multiple benchmark datasets.
Developed a contrastive learning approach for microbiome data.
Abstract
Data augmentation plays a key role in modern machine learning pipelines. While numerous augmentation strategies have been studied in the context of computer vision and natural language processing, less is known for other data modalities. Our work extends the success of data augmentation to compositional data, i.e., simplex-valued data, which is of particular interest in the context of the human microbiome. Drawing on key principles from compositional data analysis, such as the Aitchison geometry of the simplex and subcompositions, we define novel augmentation strategies for this data modality. Incorporating our data augmentations into standard supervised learning pipelines results in consistent performance gains across a wide range of standard benchmark datasets. In particular, we set a new state-of-the-art for key disease prediction tasks including colorectal cancer, type 2 diabetes,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsOral microbiology and periodontitis research · Dental Radiography and Imaging · HIV/AIDS oral health manifestations
MethodsContrastive Learning
