# Optimizing Machine Learning-Based Prediction of Terrestrial Dissolved Organic Matter in the Ocean Using Fluorescence and LC-FTMS Data

**Authors:** Marlo Bareth, Boris P Koch, Gabriel Zachmann, Xianyu Kong, Oliver J. Lechtenfeld, Sebastian Maneth

PMC · DOI: 10.1021/acsomega.5c02849 · ACS Omega · 2025-07-03

## TL;DR

This study compares machine learning models to predict terrestrial organic matter in the ocean using chemical data, finding that a generalized linear model is most efficient and accurate.

## Contribution

The study introduces a scalable ML approach for analyzing DOM chemistry using LC-FTMS data and identifies optimal preprocessing and modeling strategies.

## Key findings

- A generalized linear model with sum normalization achieved 5.7% NRMSE, matching fluorescence measurement precision.
- Feature selection improved model performance using only ~2000 molecular features out of ~70,000 per sample.
- Random forest was less accurate but more robust to preprocessing and linear correlations.

## Abstract

Marine dissolved organic matter (DOM) is an extremely
complex mixture
of organic compounds that plays a crucial role in the global carbon
cycle. In the Arctic, climate change accelerates the release of terrestrial
organic carbon. Since chemical information is the only way to track
DOM sources and fate, it is essential to improve analytical and data
science approaches to assess the DOM composition. Here, we compare
random forest (RF), support vector machines, and generalized linear
models (GLM) to predict a fluorescence-derived proxy for terrestrial
DOM based on molecular formula data from liquid chromatography coupled
with Fourier transform mass spectrometry (LC-FTMS). We systematically
evaluate different data preprocessing, normalization, and ML techniques
to optimize prediction accuracy and computational efficiency. Our
results show that a generalized linear model (GLM) with sum normalization
provides the most accurate and efficient predictions, achieving a
normalized root-mean-square error (NRMSE) of 5.7%close to
the precision of the fluorescence measurement. The prediction based
on RF regression was slightly less accurate and required significantly
more computation time compared to GLM, but it was most robust against
data preprocessing and independent of linear correlations. Feature
selection significantly improved the performance of all models, with
robust predictions obtained using only ca.  2000 of the ca. 
70,000 molecular features per sample. Additionally, we assessed the
impact of chromatographic retention time on prediction accuracy and
explored the key molecular features contributing to terrestrial DOM
signatures using Shapley values and permutation importance (for RFs).
Our study is a blueprint for the application of ML to enhance the
analysis of high-resolution mass spectrometry data, offering a scalable
approach for predicting information important for the understanding
of marine DOM chemistry.

## Full-text entities

- **Chemicals:** DOM (MESH:D000090422), organic carbon (-), carbon (MESH:D002244)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12268369/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12268369/full.md

## References

56 references — full list in the complete paper: https://tomesphere.com/paper/PMC12268369/full.md

---
Source: https://tomesphere.com/paper/PMC12268369