Application of data engineering approaches to address challenges in microbiome data for optimal medical decision-making
Isha Thombre, Pavan Kumar Perepu, Shyam Kumar Sudhakar

TL;DR
This study applies data engineering techniques like SMOTE and PCA to microbiome data, improving classification accuracy and efficiency, thereby aiding personalized medicine through better handling of high-dimensional, imbalanced microbiome datasets.
Contribution
It introduces a data engineering pipeline combining SMOTE and PCA to enhance machine learning classification of microbiome data, addressing class imbalance and high dimensionality.
Findings
Ensemble classifiers (RF and XGB) outperform others in accuracy.
PCA reduces testing time significantly.
Highest accuracy achieved at species level.
Abstract
The human gut microbiota is known to contribute to numerous physiological functions of the body and also implicated in a myriad of pathological conditions. Prolific research work in the past few decades have yielded valuable information regarding the relative taxonomic distribution of gut microbiota. Unfortunately, the microbiome data suffers from class imbalance and high dimensionality issues that must be addressed. In this study, we have implemented data engineering algorithms to address the above-mentioned issues inherent to microbiome data. Four standard machine learning classifiers (logistic regression (LR), support vector machines (SVM), random forests (RF), and extreme gradient boosting (XGB) decision trees) were implemented on a previously published dataset. The issue of class imbalance and high dimensionality of the data was addressed through synthetic minority oversampling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Machine Learning in Healthcare · AI in cancer detection
MethodsPrincipal Components Analysis
