# RNA-seq-derived sequence variations are excellent features for cell line identification

**Authors:** Lisa Müller, Simon Müller, Khursheed Ul Islam Mir, Jana Lange, Sven Hagemann, Alice Wedler, Frank Hause, Claudia Misiak, Danny Misiak, Tony Gutschner, Stefan Hüttelmaier, Markus Glaß

PMC · DOI: 10.1016/j.csbj.2025.10.039 · Computational and Structural Biotechnology Journal · 2025-10-31

## TL;DR

This paper shows that RNA sequencing data can be used to accurately identify cell lines and detect contamination, offering a reliable alternative to traditional methods.

## Contribution

The study introduces a novel method using RNA-seq-derived sequence variations and machine learning for cell line identification and contamination detection.

## Key findings

- RNA-seq-derived sequence variations enable unambiguous clustering of cell lines.
- A supervised machine learning approach reliably identifies cell lines and detects cross-contamination.
- The proposed method is robust to different data pre-processing steps and quality measures.

## Abstract

Cell lines are indispensable models for analyzing molecular mechanisms underlying human diseases. However, incorrect annotation and cross-contamination can introduce severe bias in respective studies. Accordingly, various publishers request authentication of cell lines before publication. Short tandem repeat profiling is commonly used to verify cell line identity and purity but does not guarantee that published results are based on the samples tested by this method. In this study, we demonstrate that RNA-seq-derived sequence variation information is eligible for unambiguous cell line-specific clustering. Based on this finding, we propose methods for reliable cell line identification from RNA-seq data using supervised machine learning methods. In addition, we demonstrate the ability to detect cross-contamination of human cell lines. The presented methods are insensitive to different data pre-processing steps and quality measures. The proposed topFracCCLE algorithm for cell line identification and detection of cross-contamination is available as R-script at https://github.com/HuettelmaierLab/topFracCCLE.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12800365/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12800365/full.md

## References

26 references — full list in the complete paper: https://tomesphere.com/paper/PMC12800365/full.md

---
Source: https://tomesphere.com/paper/PMC12800365