# Development of an Efficient and Generalized MTSCAM Model to Predict Liquid Chromatography Retention Times of Organic Compounds

**Authors:** Mengdie Fan, Chenhui Sang, Hua Li, Yue Wei, Bin Zhang, Yang Xing, Jing Zhang, Jie Yin, Wei An, Bing Shao

PMC · DOI: 10.34133/research.0607 · Research · 2025-02-07

## TL;DR

This paper introduces a new machine learning model for accurately predicting liquid chromatography retention times of organic compounds, improving efficiency and accuracy over traditional methods.

## Contribution

The study presents a novel classification system and data augmentation method for training a high-throughput retention time prediction model.

## Key findings

- The model achieved an R2 of 0.98 and an average prediction error of 23 seconds.
- The model outperforms existing published retention time prediction methods.
- A dataset of 10,905 compounds was constructed and classified into a 3-tier hierarchy.

## Abstract

Accurate prediction of liquid chromatographic retention times is becoming increasingly important in nontargeted screening applications. Traditional retention time approaches heavily rely on the use of standard compounds, which is limited by the speed of synthesis and manufacture of standard products, and is time-consuming and labor-intensive. Recently, machine learning and artificial intelligence algorithms have been applied to retention time prediction, which show unparalleled advantages over traditional experimental methods. However, existing retention time prediction methods usually suffer from the scarcity of comprehensive training datasets, sparsity of valid data, and lack of classification in datasets, resulting in poor generalization capability and accuracy. In this study, a dataset for 10,905 compounds was constructed including their retention times. Next, an innovative classification system was implemented, classifying 10,905 compounds into a 3-tier hierarchy across 141 classes, based on functional group weighting. Then, data augmentation was performed within each category using simplified molecular input line entry system (SMILES) enumeration combined with structural similarity expansion. Finally, by training the optimal quantitative structure–retention relationship (QSRR) models for each category of compounds and selecting the best-fitting model for prediction via discriminant analysis during the prediction period, a novel and universal high-throughput retention time prediction model was established. The results demonstrate that this model achieves an R2 of 0.98 and an average prediction error of 23 s, outperforming currently published models. This study provides a scientific basis for high throughput and rapid prediction of unknown pollutants, data mining, nontargeted screening, etc.

## Full-text entities

- **Genes:** FLI1 (Fli-1 proto-oncogene, ETS transcription factor) [NCBI Gene 2313] {aka BDPLT21, EWSR2, FLI-1, SIC-1}, NCOR2 (nuclear receptor corepressor 2) [NCBI Gene 9612] {aka CTG26, N-CoR2, SMAP270, SMRT, SMRTE, SMRTE-tau}, SHROOM4 (shroom family member 4) [NCBI Gene 57477] {aka MRXSSDS, SHAP, shrm4}
- **Diseases:** SE (MESH:C566332)
- **Chemicals:** formic acid (MESH:C030544), organoheterocyclic compounds (MESH:D006571), acetonitrile (MESH:C032159), alkaloids (MESH:D000470), Benzene (MESH:D001554), lignans (MESH:D017705), polyketides (MESH:D061065), lipid (MESH:D008055), Topo-AL (-), water (MESH:D014867), nitrogen (MESH:D009584)
- **Cell lines:** S2 — Drosophila melanogaster (Fruit fly), Spontaneously immortalized cell line (CVCL_Z232)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11803058/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11803058/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/PMC11803058/full.md

---
Source: https://tomesphere.com/paper/PMC11803058