# Construction of edit-distance graphs for large sets of short reads through minimizer-bucketing

**Authors:** Pengyao Ping, Jinyan Li

PMC · DOI: 10.1093/bioadv/vbaf081 · Bioinformatics Advances · 2025-04-10

## TL;DR

This paper introduces reads2graph, a fast and efficient method to find pairs of short DNA reads with small differences, improving error correction in large sequencing datasets.

## Contribution

The novel reads2graph method combines minimizer-bucketing, improved Order-Min-Hash, and graph multi-hop traversal for high completeness and speed.

## Key findings

- reads2graph achieves 97%–100% completeness in detecting small edit-distance pairs in most cases.
- The method outperforms brute-force identification in speed while maintaining high accuracy.
- Combination bucketing techniques improve pair detection compared to single bucketing methods.

## Abstract

Pairs of short reads with small edit distances, along with their unique molecular identifier tags, have been exploited to correct sequencing errors in both reads and tags. However, brute-force identification of these pairs is impractical for large datasets containing ten million or more reads due to its quadratic complexity. Minimizer-bucketing and locality-sensitive hashing have been used to partition read sets into buckets of similar reads, allowing edit-distance calculations only within each bucket. However, challenges like minimizing missing pairs, optimizing bucketing parameters, and exploring combination bucketing to improve pair detection remain.

We define an edit-distance graph for a set of short reads, where nodes represent reads, and edges connect reads with small edit distances, and present a heuristic method, reads2graph, for high completeness of edge detection. Reads2graph uses three techniques: minimizer-bucketing, an improved Order-Min-Hash technique to divide large bins, and a novel graph neighbourhood multi-hop traversal within large bins to detect more edges. We then establish optimal bucketing settings to maximize ground truth edge coverage per bin. Extensive testing demonstrates that read2graph can achieve 97%–100% completeness in most cases, outperforming brute-force identification in speed while providing a superior speed-completeness balance compared to using a single bucketing method like Miniception or Order-Min-Hash.

reads2graph is publicly available at https://github.com/JappyPing/reads2graph.

## Full-text entities

- **Chemicals:** Thymine (MESH:D013941), SRR1543964 (-), Guanine (MESH:D006147), Cytosine (MESH:D003596), Adenine (MESH:D000225), Uracil (MESH:D014498)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12040381/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12040381/full.md

## References

28 references — full list in the complete paper: https://tomesphere.com/paper/PMC12040381/full.md

---
Source: https://tomesphere.com/paper/PMC12040381