An Imbalanced Dataset with Multiple Feature Representations for Studying Quality Control of Next-Generation Sequencing
Philipp R\"ochner, Clarissa Kr\"amer, Johannes U Mayer, Franz Rothlauf, Steffen Albrecht, Maximilian Sprang

TL;DR
This paper introduces a new dataset with diverse feature representations for NGS quality control, enabling better automated detection of quality issues across different experimental settings.
Contribution
The paper provides a novel dataset with two types of features derived from NGS samples, facilitating research on feature effectiveness in quality problem detection.
Findings
Supervised machine learning accurately predicted quality labels from the features.
The dataset enables comparison of different feature types and granularities for quality assessment.
3.2% of samples are of low quality, highlighting the dataset's class imbalance.
Abstract
Next-generation sequencing (NGS) is a key technique for studying the DNA and RNA of organisms. However, identifying quality problems in NGS data across different experimental settings remains challenging. To develop automated quality-control tools, researchers require datasets with features that capture the characteristics of quality problems. Existing NGS repositories, however, offer only a limited number of quality-related features. To address this gap, we propose a dataset derived from 37,491 NGS samples with two types of quality-related feature representations. The first type consists of 34 features derived from quality control tools (QC-34 features). The second type has a variable number of features ranging from eight to 1,183. These features were derived from read counts in problematic genomic regions identified by the ENCODE blocklist (BL features). All features describe the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
