Hierarchical clustering of DNA k-mer counts in RNA-seq fastq files reveals batch effects
Wolfgang Kaisers, Holger Schwender, Heiner Schaal

TL;DR
This study demonstrates that hierarchical clustering of DNA k-mer counts in RNA-seq fastq files can effectively detect batch effects, serving as a simple diagnostic tool for quality control in high-throughput sequencing data.
Contribution
The paper introduces a novel application of hierarchical clustering on DNA k-mer counts to identify batch effects in RNA-seq data, implemented in an accessible R package.
Findings
Batch effects detected in 60.7% of Flowcell comparisons
Hierarchical clustering reveals strong separation by batch
Filtering for high-quality reads does not eliminate batch effects
Abstract
Batch effects, artificial sources of variation due to experimental design, are a widespread phenomenon in high throughput data. Therefore, mechanisms for detection of batch effects are needed requiring comparison of multiple samples. We apply hierarchical clustering (HC) on DNA k-mer counts of multiple RNA-seq derived Fastq files. Ideally, HC generated trees reflect experimental treatment groups and thus may indicate experimental effects, but clustering of preparation groups indicates the presence of batch effects. In order to provide a simple applicable tool we implemented sequential analysis of Fastq reads with low memory usage in an R package (seqTools) available on Bioconductor. DNA k-mer counts were analysed on 61 Fastq files containing RNA-seq data from two cell types (dermal fibroblasts and Jurkat cells) sequenced on 8 different Illumina Flowcells. Results: Pairwise comparison of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMolecular Biology Techniques and Applications · Gene expression and cancer classification · RNA and protein synthesis mechanisms
