OptimOTU: Taxonomically aware OTU clustering with optimized thresholds and a bioinformatics workflow for metabarcoding data
Brendan Furneaux, Sten Anslan, Panu Somervuo, Jenni Hultman, Nerea, Abrego, Tomas Roslin, Otso Ovaskainen

TL;DR
OptimOTU is a novel, taxonomically aware OTU clustering method that optimizes thresholds based on reference taxonomy, improving accuracy in metabarcoding data analysis, and is implemented in a scalable bioinformatics pipeline.
Contribution
It introduces a taxonomically informed clustering algorithm with optimized thresholds and a comprehensive workflow for large-scale metabarcoding data analysis.
Findings
Accurately clusters sequences into taxonomic groups with placeholder pseudotaxa.
Scales efficiently to datasets with millions of reads and thousands of samples.
Provides an open-source R package with C++ speed enhancements.
Abstract
To turn environmentally derived metabarcoding data into community matrices for ecological analysis, sequences must first be clustered into operational taxonomic units (OTUs). This task is particularly complex for data including large numbers of taxa with incomplete reference libraries. OptimOTU offers a taxonomically aware approach to OTU clustering. It uses a set of taxonomically identified reference sequences to choose optimal genetic distance thresholds for grouping each ancestor taxon into clusters which most closely match its descendant taxa. Then, query sequences are clustered according to preliminary taxonomic identifications and the optimized thresholds for their ancestor taxon. The process follows the taxonomic hierarchy, resulting in a full taxonomic classification of all the query sequences into named taxonomic groups as well as placeholder "pseudotaxa" which accommodate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEnvironmental DNA in Biodiversity Studies · Gene expression and cancer classification · Microbial Community Ecology and Physiology
MethodsAttentive Walk-Aggregating Graph Neural Network · Sparse Evolutionary Training
