clustur: an R package for clustering features using sparse distance matrices
Gregory Johnson, Sarah L. Westcott, Patrick D. Schloss

TL;DR
The clustur R package provides tools for clustering 16S rRNA gene sequences into OTUs using algorithms from mothur, making them available in R for broader use.
Contribution
clustur introduces mothur's clustering algorithms into the R ecosystem, enabling easier integration and development within R.
Findings
clustur implements de novo clustering algorithms from mothur for OTU assignment.
The package enhances accessibility and integration of these algorithms within R.
It supports broader application and further development in the R ecosystem.
Abstract
The clustur R package implements the de novo clustering algorithms found in the mothur software package for assigning 16S rRNA gene sequences to operational taxonomic units (OTUs). Making these algorithms accessible through the R ecosystem will foster their further development, broader application, and integration within other R packages.
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
- —HHS | National Institutes of Health (NIH)
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Gut microbiota and health · Genetic diversity and population structure
ANNOUNCEMENT
Taxonomic classification of 16S rRNA gene sequences has been a persistent challenge in microbial ecology studies because reference databases are incomplete (1). As an alternative, operational taxonomic units (OTUs) have been widely used for describing and comparing microbial communities. Although their biological interpretation is controversial, OTUs are typically defined as a group of sequences that are more than 97% similar or less than 3% dissimilar to each other (2). Methods for applying that definition has resulted in a sizable literature. Three general approaches have emerged for assigning sequences to OTUs: de novo clustering, closed reference clustering or phylotyping, and open reference clustering (3–9). These methods are available through popular packages, including mothur and QIIME2 (10, 11).
The clustur R package implements the de novo clustering algorithms implemented in mothur. The package name references its focus on clustering and the names of its predecessors DOTUR and mothur (10, 12). This package was developed to help address two issues. First, users would be able to more easily integrate the type of analysis that mothur specializes in with popular analysis and visualization packages within the R package ecosystem. Second, by making the code behind mothur’s clustering functions accessible through the R language, we hope to encourage further development of the algorithms behind these functions and analyses based on the output of the functions. The clustur package implements hierarchical clustering algorithms, including the furthest, nearest, unweighted (i.e., average), and weighted neighbor clustering algorithms and the OptiClust algorithm. Functions implementing the hierarchical algorithms already exist within R; however, their implementations within clustur make use of a sparse input distance matrix and output data for a single distance threshold. The benefits of censoring distances larger than the threshold and only outputting data for a single threshold include a smaller memory requirement and faster execution times (4). clustur makes use of the Rcpp R package to implement C ++ code originally written for the mothur software package to preserve the speed of the functions.
Users can install the clustur package via CRAN or through the devtools package’s install_github function. The primary input to clustur’s functions is a sparse distance matrix and a count file. The sparse distance matrix is a data.table package object with two columns indicating the identifiers of the sequences being compared and a column with the distance between those sequences; data for comparisons with a distance larger than the desired threshold (e.g., 0.03) do not need to be included. The count file is a data.table package object indicating the number of times a sequence is found in each sample. The cluster functions output two data.table objects. The first one has two columns indicating the sequences and OTU identifiers. The second displays the abundance of each sequence in each OTU. This has identical functionality to the cluster and make.shared functions from mothur. Detailed vignettes are available within the package to teach users how to install the package, use its functions, and perform downstream analyses, including analysis within the vegan and ggplot2 R packages.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Wang Q, Garrity GM, Tiedje JM, Cole JR. 2007. Naïve Bayesian classifier for rapid assignment of r RNA sequences into the new bacterial taxonomy. Appl Environ Microbiol 73:5261–5267. doi:10.1128/AEM.00062-0717586664 PMC 1950982 · doi ↗ · pubmed ↗
- 2Stackebrandt E, Goebel BM. 1994. Taxonomic note: a place for DNA-DNA reassociation and 16s r RNA sequence analysis in the present species definition in bacteriology. Int J Syst Evol Microbiol 44:846–849. doi:10.1099/00207713-44-4-846 · doi ↗
- 3Navas-Molina JA, Peralta-Sánchez JM, González A, Mc Murdie PJ, Vázquez-Baeza Y, Xu Z, Ursell LK, Lauber C, Zhou H, Song SJ, Huntley J, Ackermann GL, Berg-Lyons D, Holmes S, Caporaso JG, Knight R. 2013. Advancing our understanding of the human microbiome using QIIME, p 371–444. In Microbial metagenomics, metatranscriptomics, and metaproteomics. Elsevier.10.1016/B 978-0-12-407863-5.00019-8PMC 451794524060131 · doi ↗ · pubmed ↗
- 4Schloss PD, Westcott SL. 2011. Assessing and improving methods used in operational taxonomic unit-based approaches for 16S r RNA gene sequence analysis. Appl Environ Microbiol 77:3219–3226. doi:10.1128/AEM.02810-1021421784 PMC 3126452 · doi ↗ · pubmed ↗
- 5Schloss PD. 2016. Application of a database-independent approach to assess the quality of operational taxonomic unit picking methods. m Systems 1. doi:10.1128/m Systems.00027-16PMC 506974427832214 · doi ↗ · pubmed ↗
- 6Westcott SL, Schloss PD. 2015. De novo clustering methods outperform reference-based methods for assigning 16S r RNA gene sequences to operational taxonomic units. Peer J 3:e 1487. doi:10.7717/peerj.148726664811 PMC 4675110 · doi ↗ · pubmed ↗
- 7Westcott SL, Schloss PD. 2017. Opti Clust, an improved method for assigning amplicon-based sequence data to operational taxonomic units. m Sphere 2:e 00073-17. doi:10.1128/m Sphere Direct.00073-1728289728 PMC 5343174 · doi ↗ · pubmed ↗
- 8Kopylova E, Navas-Molina JA, Mercier C, Xu ZZ, Mahé F, He Y, Zhou H-W, Rognes T, Caporaso JG, Knight R. 2016. Open-source sequence clustering methods improve the state of the art. m Systems 1. doi:10.1128/m Systems.00003-15PMC 506975127822515 · doi ↗ · pubmed ↗
