# PhyloRef: A Semi‐Automated Workflow for eDNA Reference Database Curation via Phylogenetic Anomaly Detection

**Authors:** Yan Mai, Chenhong Li

PMC · DOI: 10.1002/ece3.73159 · Ecology and Evolution · 2026-02-26

## TL;DR

PhyloRef is a new tool that improves the accuracy of environmental DNA databases by detecting and flagging problematic sequences.

## Contribution

PhyloRef introduces a semi-automated workflow using phylogenetic anomaly detection to curate eDNA reference databases.

## Key findings

- PhyloRef identified and removed 410 anomalous sequences from NCBI datasets of cartilaginous and ray-finned fishes.
- The workflow flagged 606 sequences with 'similar_to=' labels to highlight potential misidentification risks.
- Curated databases contained 380 and 7258 sequences for chondrichthyan and actinopterygian fishes, respectively.

## Abstract

Environmental DNA (eDNA) analysis depends critically on high‐quality reference databases. However, widely used public repositories (e.g., NCBI) frequently suffer from annotation error, species misidentification, and sequence contamination, leading to unreliable biodiversity assessments. To address these issues, we introduce PhyloRef, a Snakemake‐based, semi‐automated phylogeny‐guided workflow for reference library curation. PhyloRef improves scalability via taxonomic grouping, detects problematic records using clustering‐based anomaly detection rather than rigid monophyly requirements, and conservatively flags ambiguous cases using a “similar_to=” annotation. PhyloRef leverages complete mitochondrial genomes while flexibly incorporating single‐gene sequences to maximize taxonomic coverage when complete genomes are scarce. The workflow categorizes anomalies into three types: (1) single‐sequence outliers, (2) inconsistent sequence pairs, and (3) minority deviations within multi‐sequence clusters, flagging them for manual review via convenient visualizations or deleting them automatically by option. Importantly, sequences with ambiguous phylogenetic placement are annotated with a “similar_to=” label to alert users to potential uncertainty. We validated PhyloRef using mitochondrial genome datasets for Chondrichthyes (cartilaginous fishes) and Actinopterygii (ray‐finned fishes) extracted from NCBI. The tool identified and removed nine anomalous chondrichthyan sequences and 401 Actinopterygian sequences (~2.3% and ~5.2% of the initial datasets, respectively), yielding curated databases of 380 sequences (266 species) and 7258 sequences (4887 species), respectively. In addition, nine sequences were flagged with “similar_to=” label in chondrichthyan fishes and 597 in Actinopterygian fishes, to reduce the risk of misidentification in downstream eDNA analyses. This resource enhances the reliability of eDNA‐based biodiversity and ecological studies. Future directions include integrating machine learning for anomaly detection, incorporating nuclear markers for improved taxonomic resolution, and developing automated updating modules.

PhyloRef processes complete mitochondrial genomes with optional multi‐gene concatenation to flag three categories of phylogenetic anomalies based on tree topology and annotates ambiguous sequences with “similar_to=” labels. The workflow successfully curated 7600+ chondrichthyan and actinopterygian sequences from NCBI, identifying and removing 410 anomalous entries to enhance database reliability.

## Linked entities

- **Species:** Chondrichthyes (taxon 7777), Actinopterygii (taxon 7898)

## Full-text entities

- **Genes:** COX1 [NCBI Gene 808417], ATP8 [NCBI Gene 808420], CYTB [NCBI Gene 808423], ND4L [NCBI Gene 808426], ATP6 [NCBI Gene 808428]
- **Diseases:** Type (MESH:D006969), -Finned and Cartilaginous (MESH:D015831), III (MESH:C537189)
- **Species:** Somniosus pacificus (Pacific sleeper shark, species) [taxon 305516], Microphysogobio tungtingensis (long-nosed gudgeon, species) [taxon 328543], Carassius auratus (goldfish, species) [taxon 7957], Chondrichthyes (cartilaginous fishes, class) [taxon 7777], Maccullochella macquariensis (trout cod, species) [taxon 135760], Pempheridae (sweepers, family) [taxon 30859], Coccophora langsdorfii (species) [taxon 74099], Carassius gibelio (gibel carp, species) [taxon 101364], Epinephelus bruneus (longtooth grouper, species) [taxon 323802], Carassius cuvieri (Japanese crucian carp, species) [taxon 52617], Actinopterygii (fishes, superclass) [taxon 7898], Engraulis encrasicolus (European anchovy, species) [taxon 184585], Carassius (genus) [taxon 7956], Carassius carassius (crucian carp, species) [taxon 217509], Atlantic sailfish [taxon 215398], Pempheris schwenkii (black-stripe sweeper, species) [taxon 463600], Somniosus microcephalus (Greenland shark, species) [taxon 191813], Centropyge interrupta (Japanese pygmy angelfish, species) [taxon 1474813], Epinephelus moara (kelp grouper, species) [taxon 300413]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12946455/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12946455/full.md

## References

33 references — full list in the complete paper: https://tomesphere.com/paper/PMC12946455/full.md

---
Source: https://tomesphere.com/paper/PMC12946455