FastDup: a scalable duplicate marking tool using speculation-and-test mechanism
Zhonghai Zhang, Yewen Li, Ke Meng, Chunming Zhang, and Guangming Tan

TL;DR
FastDup is a scalable, high-performance duplicate marking tool for gene sequence analysis that significantly outperforms existing methods in speed while maintaining identical results.
Contribution
FastDup introduces a speculation-and-test mechanism to improve duplicate marking efficiency, achieving up to 20x speedup over Picard MarkDuplicates.
Findings
Achieves up to 20x throughput speedup
Guarantees 100% identical output to Picard MarkDuplicates
Scalable solution suitable for large datasets
Abstract
Duplicate marking is a critical preprocessing step in gene sequence analysis to flag redundant reads arising from polymerase chain reaction(PCR) amplification and sequencing artifacts. Although Picard MarkDuplicates is widely recognized as the gold-standard tool, its single-threaded implementation and reliance on global sorting result in significant computational and resource overhead, limiting its efficiency on large-scale datasets. Here, we introduce FastDup: a high-performance, scalable solution that follows the speculation-and-test mechanism. FastDup achieves up to 20x throughput speedup and guarantees 100\% identical output compared to Picard MarkDuplicates. FastDup is a C++ program available from GitHub (https://github.com/zzhofict/FastDup.git) under the MIT license.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Advanced Proteomics Techniques and Applications · Software Testing and Debugging Techniques
MethodsParsing Incrementally for Constrained Auto-Regressive Decoding
