# Predicting oil contamination in water using machine learning on microbial compositions

**Authors:** Tong Gao, Isaac Bigcraft, Stephen Techtmann, Issei Nakamura

PMC · DOI: 10.1371/journal.pone.0344571 · PLOS One · 2026-03-19

## TL;DR

This paper introduces a machine learning framework that uses microbial data to predict oil contamination in water, showing high accuracy in controlled tests but limited generalization.

## Contribution

A novel machine learning framework combining dimensionality reduction, data augmentation, and generative modeling for predicting oil contamination from microbial data.

## Key findings

- Feature importance outperformed PCA and t-SNE in reducing microbial data dimensions.
- The model achieved an R² of up to 0.99 in training and stress testing.
- Generalization was limited, with lower performance on held-out bottles (mean test R² = −0.150).

## Abstract

We present a compact and generative machine-learning framework that predicts oil contamination based on microbial community compositions from experimental samples. Our method combines dimensionality reduction with data augmentation and generative modeling to address high-dimensional, non-linear, and sparse microbial data. To reduce the 503-dimensional bacterial composition dataset, we compared three dimensionality reduction techniques: feature importance from random forest, principal component analysis (PCA), and t-distributed stochastic neighbor embedding (t-SNE). Feature importance outperformed PCA and t-SNE, improving predictive performance and identifying microbial species most strongly correlated with oil contamination. To mitigate data scarcity, we augmented the training data using an augmented data neural network (ADNN) with noise injection. Samples generated by a variational autoencoder (VAE) were used as controlled perturbations to probe model robustness during stress testing. Using the top 3–10 bacterial features, our model achieved an R² value of up to 0.99 in both training and stress testing for predicting oil contamination from microbial data. In a bottle-level hold-out evaluation (22 splits at an 80/20 bottle ratio), performance on held-out bottles was lower and variable (mean test R² = −0.150), indicating limited generalization within this cohort. These results should be interpreted as a feasibility demonstration requiring validation on larger independent datasets.

## Full-text entities

- **Chemicals:** water (MESH:D014867), hydrocarbon (MESH:D006838), Oil (MESH:D009821), BN (-)
- **Species:** Alcanivorax (genus) [taxon 59753], Homo sapiens (human, species) [taxon 9606], Bacteria Latreille et al. 1825 (Bacteria stick insect, genus) [taxon 629395]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13001938/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13001938/full.md

## References

32 references — full list in the complete paper: https://tomesphere.com/paper/PMC13001938/full.md

---
Source: https://tomesphere.com/paper/PMC13001938