# CYTO-SV-ML: A Machine Learning Tool for Cytogenetic Structural Variant Analysis in Somatic Cell Type Using Genome Sequences

**Authors:** Tao Zhang, Paul Auer, Stephen R. Spellman, Jing Dong, Wael Saber, Yung-Tsi Bolon

PMC · DOI: 10.3390/life15060929 · 2025-06-09

## TL;DR

CYTO-SV-ML is a machine learning tool that improves the detection of large somatic structural variants in genome sequencing data, outperforming traditional methods in accuracy and uncovering new variants in most patients.

## Contribution

A high-performance machine learning pipeline for identifying large somatic structural variants in genomic data, with improved accuracy over conventional methods.

## Key findings

- CYTO-SV-ML achieved an AUCROC of 0.94 for translocations and 0.92 for non-translocations in classifying somatic SVs.
- The tool identified 207 somatic SVs compared to 143 by a conventional pipeline in clinical validation.
- CYTO-SV-ML uncovered novel SVs in 89% of patients with unsuccessful clinical cytogenetic results.

## Abstract

(1) Background: Although whole genome sequencing (WGS) has enabled the comprehensive analyses of structural variants (SVs), more accurate and efficient methods are needed to distinguish large somatic SVs (SV size ≥ 1 Mb) traditionally detected through cytogenetic testing from germline SVs. (2) Methods: A customized machine learning pipeline (CYTO-SV-ML) under Snakemake automation workflow was developed with a user interface to identify somatic cytogenetic SVs in WGS data. And this tool was applied for characterizing structural variation profiles in the whole blood of patients with myelodysplastic syndromes (MDSs). Known SVs mapped from well-established open databases were split into training and validation subsets for an AUTO-ML machine learning model in a CYTO-SV-ML pipeline. (3) Results: The benchmarking performance of the CYTO-SV-ML pipeline on somatic cytogenetic SV classification displayed an area under the receiver operating characteristic curve (AUCROC) of 0.94 for translocations and 0.92 for non-translocations, a sensitivity of 0.83 for translocations and 0.85 for non-translocations, and a specificity of 0.96 for translocations and 0.82 for non-translocations. Our method (207 somatic cytogenetic SVs) outperformed a conventional SV calling pipeline (143 somatic cytogenetic SVs) in an independent validation of clinical cytogenetic records. In addition, the CYTO-SV-ML pipeline uncovered novel somatic cytogenetic SVs in 49 (89%) of 55 patients without successful clinical cytogenetic results. (4) Conclusions: Our study demonstrates the high-performance machine learning approach of CYTO-SV-ML on benchmarking SV classification from genomic sequencing data, and further validations of novel anomalies by orthogonal methods will be essential to unlock its full clinical potential of cytogenetic diagnostics.

## Linked entities

- **Diseases:** myelodysplastic syndromes (MONDO:0018881)

## Full-text entities

- **Diseases:** MDSs (MESH:D009190)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12194788/full.md

---
Source: https://tomesphere.com/paper/PMC12194788