# A scalable distributed pipeline for reference-free variants calling

**Authors:** Lorenzo Di Rocco, Umberto Ferraro Petrillo

PMC · DOI: 10.1186/s12864-025-11722-7 · BMC Genomics · 2025-06-03

## TL;DR

This paper introduces a new distributed computing pipeline for efficiently identifying genetic variations without relying on a reference genome.

## Contribution

The novelty is a scalable, distributed pipeline for reference-free SNP calling using a cluster-driven De Bruijn graph partitioning algorithm.

## Key findings

- The pipeline efficiently handles large datasets through distributed De Bruijn graph representation.
- A cluster-driven partitioning algorithm improves performance and scalability compared to standard techniques.
- Experiments on real-world data confirm high efficiency, output quality, and scalability.

## Abstract

Precision medicine pipelines typically begin with variant calling to identify disease-related mutations for optimal treatment selection. Reference-free approaches assess variations in the genetic profiles of distinct individuals through the utilization of a De Bruijn graph. However, the timely analysis of large-scale sequencing data may be beyond the capabilities of single workstations, requiring alternative computational approaches.

We introduce the first-known distributed pipeline for detecting isolated SNPs (Single Nucleotide Polymorphisms), by leveraging the computational resources of multiple machines in parallel. Our pipeline efficiently analyzes large datasets thanks to the usage of a distributed De Bruijn graph representation. Furthermore, we introduce a cluster-driven algorithm to partition the De Bruijn graph across multiple independent machines according to the inner structure of the sequences under analysis, thus further improving the scalability of our pipeline.

The results of our experiments, conducted on real-world datasets, show the good performance of our pipeline in terms of efficiency, output quality and scalability. Moreover, the reported results also confirm that the adoption of a specialized partitioning algorithm for the distributed representation of the De Bruijn graph leads to a relevant performance speed-up compared to using standard partitioning techniques.

## Full-text entities

- **Diseases:** bubbles (MESH:C531816)
- **Chemicals:** UFP (MESH:C041500), LDR (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12131334/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12131334/full.md

## References

5 references — full list in the complete paper: https://tomesphere.com/paper/PMC12131334/full.md

---
Source: https://tomesphere.com/paper/PMC12131334