RNA-seq-derived sequence variations are excellent features for cell line identification

Lisa Müller; Simon Müller; Khursheed Ul Islam Mir; Jana Lange; Sven Hagemann; Alice Wedler; Frank Hause; Claudia Misiak; Danny Misiak; Tony Gutschner; Stefan Hüttelmaier; Markus Glaß

PMC · DOI:10.1016/j.csbj.2025.10.039·October 31, 2025

RNA-seq-derived sequence variations are excellent features for cell line identification

Lisa Müller, Simon Müller, Khursheed Ul Islam Mir, Jana Lange, Sven Hagemann, Alice Wedler, Frank Hause, Claudia Misiak, Danny Misiak, Tony Gutschner, Stefan Hüttelmaier, Markus Glaß

PDF

Open Access

TL;DR

This paper shows that RNA sequencing data can be used to accurately identify cell lines and detect contamination, offering a reliable alternative to traditional methods.

Contribution

The study introduces a novel method using RNA-seq-derived sequence variations and machine learning for cell line identification and contamination detection.

Findings

01

RNA-seq-derived sequence variations enable unambiguous clustering of cell lines.

02

A supervised machine learning approach reliably identifies cell lines and detects cross-contamination.

03

The proposed method is robust to different data pre-processing steps and quality measures.

Abstract

Cell lines are indispensable models for analyzing molecular mechanisms underlying human diseases. However, incorrect annotation and cross-contamination can introduce severe bias in respective studies. Accordingly, various publishers request authentication of cell lines before publication. Short tandem repeat profiling is commonly used to verify cell line identity and purity but does not guarantee that published results are based on the samples tested by this method. In this study, we demonstrate that RNA-seq-derived sequence variation information is eligible for unambiguous cell line-specific clustering. Based on this finding, we propose methods for reliable cell line identification from RNA-seq data using supervised machine learning methods. In addition, we demonstrate the ability to detect cross-contamination of human cell lines. The presented methods are insensitive to different data…

Figures6

Click any figure to enlarge with its caption.

Clustering of cell lines by poly-A enriched RNA-seq-derived sequence variations. A Relation between detected sequence variations and mapped reads. B Histogram showing the number of variations shared by a certain amount of samples. C Hierarchical clustering of samples based on sequence variations.Fig. 1

Clustering of cell lines by small RNA-seq-derived sequence variations. A Numbers of sequence variations found in the whole sample and miRNA loci. B Hierarchical clustering using all detected, non sample-exclusive, variations. C Hierarchical clustering using only non sample-exclusive variations detected in miRNA loci.Fig. 2

kNN classification results. A Prediction probabilities of samples from set ENCODE_polyA_, containing poly A-enriched RNA-seq samples. B Prediction probabilities of samples from set ENCODE_total_, containing total RNA-seq samples.Fig. 3

Scheme for predicting cell line identities using CCLE-derived SNP data. Image created with BioRender.Fig. 4

topFracCCLE predictions on *in silico* and *in vitro* contamination simulations. A Proportion of detected SNPs associated with H-1975 or KELLY cells in dependence of the proportion of sequencing reads originating from samples of the respective cell line. B First, second and third hit cell line reported by topFracCCLE for the respective mixture samples. Numbers in the sample labels refer to the percentage of reads from pure H-1975 samples. C Proportion of detected SNPs associated with SW1990 or Panc-03.27 cells in dependence of the proportion of cells from the respective line in the samples. D

Keywords

Cell line identificationCross-contaminationMachine learningK-nearest neighborCCLE

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSingle-cell and spatial transcriptomics · Microbial infections and disease research · Cell Image Analysis Techniques

Full text

Introduction

1

Cell lines have been used for decades as models for human diseases as well as for virus production, vaccine development and for synthesizing recombinant proteins [1]. However, mis-labeling and cross-contamination are frequently observed in research labs, requiring the occasional re-authentication of cell models, especially before publishing. Current estimates suggest that up to 36 % of cell lines are cross-contaminated or misidentified [2], indicating the necessity for rapid, cost-effective and easy-accessible methods for cell line authentication.

Cell line models are derived from individual patients or animals and thus carry the unique genomic code of the respective donor. Service providers offer authentication, usually based on short tandem repeat (STR) profiling. This method has been recommended by the American Type Culture Collection (ATCC) [3] and is frequently requested to prove the identity of cell lines used in a specific experiment. However, STR profiling does not guarantee that published results have been generated with the cells being submitted for testing, since interchanges or contamination may occur after performing the respective tests. Next generation sequencing (NGS) data provide excellent options for cell line authentication, as they offer DNA/RNA sequence information at high resolution, allowing the rapid identification of sequence variations. Although high-throughput RNA sequencing (RNA-seq) provides less variant information than DNA-based whole genome sequencing (WGS), it has been shown that RNA-seq-derived sequence variations can be used for cell line identification [4], [5].

Here, we consolidate and extend this finding by exploring requirements and limitations for robust cell line identifications based on RNA-seq-derived sequence variation information. We used unsupervised clustering to demonstrate that RNA-seq samples from distinct cell lines form separate clusters, even at comparatively moderate sequence coverage. We have trained a k-Nearest-Neighbor (kNN) classifier with sequence variation data from 22 distinct cell line models. Classification yielded perfect classification results for our in-house as well as publicly available datasets. Exploiting RNA-seq information from more than 1700 cell lines, provided by the Cancer Cell Line Encyclopedia (CCLE, [6]) project, we finally implemented a method for human cell line identification that, in addition, can be used for detecting cross-contamination of human cell lines.

Materials and methods

2

Cell culture

2.1

Cells were cultured in the appropriate medium supplemented with 1 % GlutaMAX (Thermo Fisher Scientific, Waltham, MA, USA) and fetal bovine serum (FBS) at the appropriate concentration at 37 $[eqn]$ C and 5 % CO2. Detailed information is provided in Table S1.

Library preparation and RNA sequencing

2.2

RNA integrity was assessed using a Bioanalyzer 2100 (Agilent, Santa Clara, CA, USA). RNA sequencing (RNA-seq) libraries were prepared according to the manufacturer’s instructions. For mRNA-seq, total RNA was used as input for poly (A) enrichment with oligo (dT) beads. After fragmentation, the first strand cDNA was synthesized using random hexamer primers. Then, the second strand cDNA was synthesized using dUTP, instead of dTTP. The directional library was ready after end repair, A-tailing, adapter ligation, size selection, USER enzyme digestion, amplification, and purification. For small RNA-seq, 3’ and 5’ adaptors were ligated to 3’ and 5’ end of small RNA, respectively. Then, the first strand cDNA was synthesized after hybridization with reverse transcription primer. The double-stranded cDNA library was generated through PCR enrichment. After purification and size selection, libraries with insertions between 18 and 40 bp were ready for sequencing. Library preparation and sequencing were performed either by Novogene (Munich, Germany) or in-house (Details in Table S2.)

Sample preparation of SW1990 and Panc-03.27 for cell line authentication and RNA sequencing was performed in parallel. Cells were trypsinized and counted to ensure 1.5x10^6^ cells in each sample. For both cell lines, non-mix-up samples contained 100 % of SW1990 or Panc-03.27 cells, respectively. For mix-up samples, three conditions were performed: 50 % of SW1990 + 50 % of Panc-03.27 cells, 25 % of SW1990 + 75 % of Panc-03.27 cells, and 10 % of SW1990 + 90 % of Panc 03.27 cells. All samples were run in triplicates. For RNA sequencing, cells were homogenized in Trizol and total RNA was isolated by phenol/chloroform extraction and isopropanol precipitation. Library preparation (Poly(A) tail RNA selection) and RNA sequencing (total RNA amount $[eqn]$ 0.2 $[eqn]$ g) were performed by Novogene (Munich, Germany) on a NovaSeq X Plus Series system (Illumina, San Diego, USA).

STR profiling

2.3

Cell lines were authenticated by Eurofins Genomics (Ebersberg, Germany) STR profiling service, which carried out DNA isolation from cell pellets and determined genetic characteristics by PCR-single-locus-technology using the AmpFlSTR Identifiler Plus PCR Amplification Kit (Thermo Fisher). This kit comprises 15 core loci: D8S1179, D21S11, D7S820, CSF1PO, D3S1358, tyrosine hydroxylase (TH01), D13S317, D16S539, D2S1338, D19S433, von Willebrand factor (vWA), human thyroid peroxidase gene (TPOX), D18S51, D5S818, and $[eqn]$ -Fibrinogen (FGA) plus the gender-determining marker amelogenin (AMEL). Based on electropherograms, the analytical report lists all 16 loci and calls (cf. tables S10, S11). These alphanumerically scored alleles were entered into the STR profile database for human cell lines of the German Collection of Microorganisms and Cell Cultures (DSMZ) ([7], [8]) to compare with reference databases by checking for percent matches between cell lines or mix-ups.

Pre-processing of RNA-seq data

2.4

Adapter sequences as well as low quality read ends were clipped off using Cutadapt. The processed sequencing reads were aligned to the human reference genome (UCSC hg38) using HiSat2 [9] (polyA-RNA) or Bowtie2 [10] (smallRNA). Samtools [11] was used to extract primary alignments and to index the resulting BAM files. FeatureCounts [12] was used for summarizing gene-mapped reads, using ENSEMBL v100 [13] as annotation basis. FPKM values were obtained from raw count data by applying trimmed mean of M-values (TMM) normalization using the R-package edgeR [14], [15]. Down-sampling of BAM files was performed using the sample command from the bedtools suite [16].

Generation of simulated cell line contamination data

2.5

To simulate contamination of cells from two distinct human cell lines, defined numbers of sequencing reads were randomly sampled from FASTQ files of pure samples using the seqtk sample function (https://github.com/lh3/seqtk). Subsequently, reads from the different cell lines were merged together into new FASTQ files and processed as described above.

Variant calling

2.6

Variant calling from alignment files in BAM format was performed using mpileup and call commands from the bcftools suite [17]. The human UCSC hg38 genome sequence was used as reference. Base calling quality values of at least 20 and sequencing depths of at least 10 reads were used as filter criteria.

Clustering and k-nearest-neighbor classification

2.7

Variants found in each sample were collected in a binary matrix assigning the presence of all found variants in each sample. Variants found exclusively in one sample were removed from the matrix and pair-wise Jaccard distances of all samples were calculated using the remaining variants. Clustering was performed using the hclust function of R [18] based on the distance calculations and application of the ward.D2 algorithm. Down-sampling was performed by random sampling of aligned reads from BAM files using the sample function of the bedtools suite [16].

For kNN training, the knn3Train function of the caret R-package [19] was applied using the variant-presence matrix.

CCLE data acquisition and TopFrac classification

2.8

CCLE cell line single nucleotide polymorphism (SNP) data were gathered from depmap via the ExperimentHub R-package, table mutationCalls_22Q2 [20]. These data were used to construct a binary matrix specifying the presence of all identified SNPs in each cell line. Since the depmap-provided data annotations were based on the human genome build GRCh37, genomic positions were updated to GRCh38 using the UCSC liftover website (https://genome.ucsc.edu/cgi-bin/hgLiftOver).

For cell line identification, the number of common SNPs from the query sample and each cell line was determined.

Generation of validation data sets from public data

2.9

For validation purposes, we assembled distinct validation data sets. ENCODE_polyA and ENCODE_total were obtained by downloading hg38-mapped alignment files of the respective samples from the ENCODE data portal [21]. Subsequently, variant calling was performed as described in Section 2.6. Composition and ENCODE accession IDs of all samples can be found in Table S4 and S5.

Data set U87 was generated by obtaining the compound vcf file associated with NCBI GEO series GSE129142. The content of the compound vcf file was split to obtain single-sample vcf files and chromosome names were prefixed with “chr” to obtain vcf files compatible with topFracCCLE. To obtain a data set appropriate for use with CCLHunter, we downloaded the respective sra files and extracted FASTQ files from these using the fasterq-dump command from the sra-toolkit. Extracted paired-end FASTQ files were processed further as described in Section 2.4.

Cell line identification using CCLHunter

2.10

For comparison, we identified cell lines using the stand-alone version of CCLHunter v4.0 [22]. Prediction was performed using alignment files (bam) of the respective samples and default parameter values.

Figures

2.11

All figures were created using the R-packages ggplot2 [23] and patchwork, unless explicitly stated otherwise.

Results

3

RNA-seq-derived variant calling leads to cell line-specific clustering

3.1

For assessing whether RNA-seq samples from multiple distinct cell lines form homogeneous clusters, we assembled a dataset comprising 90 poly-A enriched samples from 22 distinct cell lines containing at least three samples per line. These samples represent control experiments from various experiments performed in our lab and comprise either wild type cells or cells transfected with control siRNAs and have been sequenced between 2017 and 2024. Cell culture, RNA isolation and library preparation have been performed by distinct experimenters and sequencing has been conducted by different facilities (cf. Table S2). Thus, the samples displayed a high degree of heterogeneity with respect to processing and technological settings. Variant calling, using the human genome (UCSC hg38) as a reference, revealed a high variability in the number of detected variants, as expected. These numbers not only drastically differ between distinct cell lines but also show substantial inter-experimental batch variation, in part even between experimental replicates (Fig. S1A).

One important factor influencing the number of variants detected is sequencing depth. This is reflected by a strong positive correlation between the number of sequenced reads and the number of variants detected for the respective sample (Fig. 1A, S1). For all samples, the vast majority of detected sequence variations was unique to the respective sample (Fig. 1B). Since cell line clustering would not benefit from including variations only present in one particular sample, sample-specific variants were ignored while constructing a binary matrix assigning all detected variants shared by at least two samples to these respective samples. Furthermore, the removal of sample-exclusive variants reduced storage requirements and runtime to create the assignment matrix considerably. Hierarchical clustering using Jaccard distances between the samples yielded sub-trees consisting entirely of samples from the same cell lines. After sub-dividing the obtained dendrogram into the 22 top sub-clusters, each sub-cluster contained samples from exactly one cell line and each cell line was represented by exactly one sub-cluster (Fig. 1C, Table S3). Thus, RNA-seq derived variations from a reference genome sequence are suitable for the discrimination of samples belonging to a certain cell line model. Notably, experiment-specific samples from the same parental cell lines formed sub-clusters as well, indicating that even passage- or experimentally motivated sequence variations are identified (Fig. 1C, Table S3).Fig. 1. Clustering of cell lines by poly-A enriched RNA-seq-derived sequence variations. A Relation between detected sequence variations and mapped reads. B Histogram showing the number of variations shared by a certain amount of samples. C Hierarchical clustering of samples based on sequence variations.Fig. 1

For comparison, we also performed hierarchical clustering using normalized expression values (FPKM) of all protein-coding genes as input. However, although samples derived from the same experiments clustered together, accurate clustering of cells of the same origin was not achieved with this approach. For example, samples from the pancreatic cancer-derived cell line PANC-1 (samples X76 - X81) were distributed in very distinct sub-clusters. Subdividing the dendrogram into the 22 top clusters revealed further discrepancies in the cluster identities (Fig. S2). This indicated that, without further feature-selection procedures, the binary representation of sequence variants is a more robust measure for clustering samples according to cell line identity.

Since the number of detected variants depends on sequencing depth, we investigated how many sequencing reads are sufficient to retain optimal clustering results. Thus, we randomly sub-sampled the reads to obtain pre-defined numbers, performed hierarchical clustering as described above and divided the resulting dendrograms into 22 clusters and examined (a) whether these clusters contained exclusively samples from one cell line and (b) whether all samples belonging to the same cell line were included in exactly one cluster. As expected, the number of detected variants decreased with reduced sequencing depth (Fig. S3A). However, these investigations revealed that as few as 500,000 reads per sample were sufficient to optimally cluster the samples according to the cell line model (Fig. S3B). Thus, since typical RNA-seq samples, e.g., created for assessing differential gene expression, usually comprise more than 10 million reads or read-pairs, sequencing depth seems not to be a limitation for reliable cell type-specific clustering in a common setup.

To test next whether the described sequence variation-based clustering approach is capable of discriminating cell line identities from microRNA (miRNA) expression profiles as well, we assembled another data set consisting of 18 small RNA-seq samples representing six distinct cell lines. Sequencing read numbers mapped to the human reference genome ranged from 11.4 million reads to 30.1 million reads per sample.

Sequence variations found in these samples were significantly fewer in number compared to the poly-A enriched samples, ranging between 151 (MB-1) and 5194 (A-549). Yet, these low numbers were sufficient to reliably cluster the samples according to their cell line identities (Fig. 2A,B).Fig. 2. Clustering of cell lines by small RNA-seq-derived sequence variations. A Numbers of sequence variations found in the whole sample and miRNA loci. B Hierarchical clustering using all detected, non sample-exclusive, variations. C Hierarchical clustering using only non sample-exclusive variations detected in miRNA loci.Fig. 2

Small RNA-seq libraries are usually generated by selecting short PCR products and, thus, contain cDNAs of various short RNA types, including miRNAs. Accordingly, the majority of variants detected in our small RNA-seq samples were found in loci not covering miRNA transcripts. However, restricting detected variants to those located in known miRNA loci, according to miRBase v22 [24], reduced the number of variants per sample to fewer than 20, leading to sub-optimal clustering results (Fig. 2C). Thus, although our results demonstrate the feasibility of differentiating cell lines by utilizing sequence variations obtained from small RNA-seq samples, variation amounts found inside miRNAs loci alone were not sufficient to unambiguously separate the samples by cell line identity.

Nearest-neighbor classification leads to reliable prediction results

3.2

Since hierarchical clustering based on sequence variations can be applied to distinguish samples from different cell line models reliably, we hypothesized that sequence variation information could be used to accurately predict cell line identities from unknown sources as well. To test this hypothesis, we implemented a kNN trained on a new dataset (in-house_polyA_) containing exactly three samples per cell line. Thus, the set contained 66 samples from 22 cell lines. Analogous to the hierarchical clustering, the presence of distinct sequence variations was used as feature vector. Sample-exclusive variants were not considered, in order to reduce the feature-space.

For assessing the performance of the classifier, we conducted a balanced three-fold cross validation. I.e., in each of three iterations, we used two thirds of the samples from our dataset for training and the remaining third for testing the kNN (k = 2), such that training and test sets always contained the same number of samples from each cell line and each sample was used exactly once for testing. Precisely, the training sets contained two, and the test sets contained one sample per cell line. In each iteration of the cross validation, all cell line identities of the test set were predicted correctly, yielding an overall prediction accuracy of 100 %.

To test generalization capability of our kNN on completely unrelated samples, we assembled validation sets from publicly available alignment files (.bam) obtained from the ENCODE project [21]. The first validation set, ENCODE_polyA_, contained 14 samples, generated from poly A-enriched libraries. It was composed of 10 samples, two each from five distinct wild type cell lines (A-549, PANC-1, HCT-116, MCF-7, Hep-G2) as well as another four Hep-G2 samples, two transfected with shRNAs against SRSF1 and two transfected with shRNAs against DDX55 (cf. Table S4). The other validation set, ENCODE_total_, contained five samples from five distinct cell lines (A-375, Hep-G2, HCT-116, MCF-7, PANC-1, cf. Table S5), generated from total RNA libraries. Variant calling results of the samples from the validation sets were used as input for a kNN trained on the whole in-house_polyA_ dataset (66 samples), setting the number of the nearest neighbors parameter k to three. All samples of the two validation sets were predicted as annotated (Fig. 3).Fig. 3kNN classification results. A Prediction probabilities of samples from set ENCODE_polyA_, containing poly A-enriched RNA-seq samples. B Prediction probabilities of samples from set ENCODE_total_, containing total RNA-seq samples.Fig. 3

Variant comparisons are suitable for universally identifying human cell lines and detecting cell line cross-contaminations

3.3

In the previous section, we demonstrated the feasibility of reliably determining the identities of cells from a small portfolio of in-house sequenced cell lines. However, in order to develop a method for the general determination of cell line identity from an unknown or uncertain sample, a more comprehensive resource was desirable. For this purpose we took advantage of sequence variant information provided by the Cancer Cell Line Encyclopedia (CCLE) project [6]. The CCLE data set, obtained via the ExperimentHub [20] repository, comprised SNP data of 1771 distinct human cell lines. In contrast to our previously described data sets, the CCLE data set lacked replicate data, i.e., there was only one feature vector per cell line available. Furthermore, the number of reported SNPs per individual cell line differed tremendously ranging from 16 to 30,821 (Fig. S4). These properties of the CCLE data set rendered it unsuitable for the previously proposed nearest-neighbor classification approach. Thus, to leverage the CCLE data set for cell line identification, we implemented a rather simple method comparing SNPs obtained from RNA-seq samples with those provided by the CCLE data set. Prediction results are derived by sorting the comparison results by the relative abundance of SNPs found in a certain cell line. I.e, the cell line for which the highest percentage of SNPs matches is presented as the top candidate for cell line identity. Fig. 4 shows a scheme of the method we called topFracCCLE.Fig. 4. Scheme for predicting cell line identities using CCLE-derived SNP data. Image created with BioRender.Fig. 4

To assess prediction accuracy, we predicted cell line identities of our in-house samples as well as the aforementioned publicly available datasets ENCODE_polyA_ and ENCODE_total_. However, since four of the cell lines from the set in-house_polyA_ were not contained in the CCLE dataset (HEK-293-T, C-643, MV3 and SK-N-BE(2)-C), we excluded those lines and created the set in-house2_polyA_ comprised of triplicate samples from 18 distinct cell lines, totaling in 54 samples. Since variant calling of the samples from datasets in-house2_polyA_, ENCODE_polyA_ and ENCODE_total_ was uniformly performed (cf. Section 2.4), we further added a dataset for validation, consisting of already processed variant calling results of glioblastoma-derived U87MG cells [25] (U87).

We compared our results with cell line predictions obtained from CCLHunter v4.0, a Python-based tool for cell line prediction using alignment files [22]. Like topFracCCLE, CCLHunter uses a CCLE-derived data basis, however, in its current version it is limited to 1389 distinct human cell lines.

topFracCCLE: For the majority of samples from all four validation sets, the cell line showing the highest fraction of SNP overlap (1st hit) was the expected cell line. Seemingly incorrect predictions were related to only two cell lines, A-375 and MCF-7. However, in all these cases, the reported first hit was a derivative of or was reported to be contaminated with the query cell line, according to CCLE recordings. I.e., the three cell lines matching best the samples of the melanoma-derived cell line A-375, regardless of which validation set, were A375 SKIN CJ1, A375 SKIN CJ2 and A375 SKIN CJ3, followed by the expected cell line (A-375) as the fourth best hit. These cell lines were derived from A-375 cells by selection for drug resistance [26], [27]. All samples of the breast cancer derived cell line MCF-7 were predicted as cell line KPL-1. According to Expasy Cellosaurus [27], this cell line is actually an MCF-7 derivative. Thus, if the contaminated/derived cell lines are considered correct as well, topFracCCLE correctly predicted all samples from the four sets. Fractions of SNPs found in correct/derived cell lines ranged from nine to 48 percent and were considerably lower in other, non-derived cell lines (Table S6-S8).

CCLHunter: Predictions using default parameters either yielded correct results or the program terminated without producing result files. Notably, inspecting the error messages revealed that in most cases, when the program stopped, CCLHunter identified either the correct cell line or a derivative as the most likely result. Thus, although in principle CCLHunter is capable of dealing with closely related cell lines, it seems that the current implementation leads to termination, when two or more cell lines from the reference are too similar to the query sample. In the remaining cases when CCLHunter did not deliver a result cell line, error messages revealed, that the program could not identify any matching cell line. Table 1 and tables S6-S8 list the prediction results of topFracCCLE and CCLHunter on all validation data sets.Table 1. Number of correctly predicted samples from distinct data sets.Table 1. Data settopFracCCLE exactincl. derivativesCCLHunter exactincl. derivativesin-house2_polyA_48/5454/5441/5441/54ENCODE_polyA_12/1414/144/144/14ENCODE_total_3/55/52/52/5U876/66/66/66/6

Since the topFracCCLE approach is based on comparing SNPs found in a query sample with those reported for distinct cell lines, we hypothesized its applicability in detecting cross-contaminations of human cell lines. More precisely, if cells from two or more cell lines are present in a given sample, RNA sequencing data are expected to contain SNP information matching a substantial fraction of SNPs reported for the respective parental cell lines.

We tested this hypothesis by simulating contamination in silico and in vitro. For in silico contamination experiments, sequencing reads from wild type samples of lung adenocarcinoma-derived H-1975 cells and neuroblastoma-derived KELLY cells were sub-sampled and mixed at defined ratios. For in vitro simulation of cell line contamination, we mixed cells from the pancreatic adenocarcinoma (PDAC)-derived cell lines SW1990 and Panc-03.27 at defined ratios. Triplicate samples from each mixture ratio were sequenced and variants were detected. For comparison, DNA from the same samples was sent for STR profiling by Eurofins Genomics. Additionally, sequencing reads from pure SW1990 and Panc-03.27 cells were mixed in silico for comparisons. Regardless, of whether cells were mixed genuinely or in silico, the numbers of detected SNPs belonging to a certain cell line increased with the real/expected fraction of these cells in the respective samples (Fig. 5A,C, Fig. S5A). Whereas in pure KELLY, SW1990 and Panc-03.27 samples approximately 40 % of reported SNPs were detected, this proportion was approximately 30 % in H-1975 cells. Samples containing a majority of cells/reads from one line were assigned this line as first hit by topFracCCLE. For the H-1975/KELLY samples, 1:1 mixture of reads was consistently assigned KELLY as first hit and H-1975 as second hit. This suggests, that a higher SNP fraction detected in non-contaminated cell populations shifts prediction results towards the line with more SNPs detected when mixed at equal ratios. Accordingly, in vitro and in silico 1:1 mixtures of SW1990 and Panc-03.27 cells resulted in either the first or the second line being predicted as first hit. In the in silico mixture of H-1975/KELLY cells, the lower concentrated cell line was obtained as second hit down to a fraction of 10 % of this line. In the SW1990/Panc-03.27 mixtures, the lower concentrated cell line was reliably found as second hit down to a fraction of 25 %, in both, in silico and in vitro simulations (Fig. 5B,D, Fig. S5B).Fig. 5topFracCCLE predictions on in silico and in vitro contamination simulations. A Proportion of detected SNPs associated with H-1975 or KELLY cells in dependence of the proportion of sequencing reads originating from samples of the respective cell line. B First, second and third hit cell line reported by topFracCCLE for the respective mixture samples. Numbers in the sample labels refer to the percentage of reads from pure H-1975 samples. C Proportion of detected SNPs associated with SW1990 or Panc-03.27 cells in dependence of the proportion of cells from the respective line in the samples. D First, second and third hit cell line reported by topFracCCLE for the respective mixture samples. Numbers in the sample labels refer to the percentage of SW1990 cells contained in the samples.Fig. 5

To compare the results of the topFracCCLE method with the gold standard, we sent the in vitro mixed samples for STR profiling to Eurofins Genomics. Using non-empty markers as scoring mode and a minimum of 14 shared markers, the non-mixed-up samples showed 100 % similarity with SW1990 (samples CL00020652-654) and Panc-03.27 (samples CL00020655-657) cell lines, respectively. The second best hits of these samples had similarities below 60 % (Table S11). In contrast, each mix-up sample (CL00020658-666) showed highest similarity with Panc-03.27 (89.3 %) and the second highest with SW1990 (75 %) cells (Table S11). These numbers align well with conclusions of the ANSI-ATCC ASN-0002 standard, suggesting that a similarity score below 70 % is indicative of unrelated donors and similarity scores between 70 % and 79 % are obtained either due to significant genetic drift or a mixture of two or more cell lines [1]. Furthermore, since more than two alleles at a locus are rarely seen in normal human DNA profiles [1], multiple peaks at one locus indicate mixed populations of cells. With respect to a consensus profile, the six non-mixed samples (CL00020652-657) contained two alleles at each locus, which is in line with 100 % similarity of these samples to the SW1990 or Panc-03.27 cell lines, respectively. More importantly, all nine mix-up samples (CL00020658-666) contained an additional allele at five of 16 loci as well as two additional alleles at one of these loci (Table S10) indicating cross-contamination in these samples.

In summary, these results suggest that the proposed topFracCCLE method is suitable for detecting cross-contamination between two human cell lines, provided that at least 25 % of cells belong to the less concentrated cell line. Based on our findings, we suggest considering detected SNP fractions above 5 % as a strong indicator of putative presence of a certain cell line model in a given sample.

Discussion

4

Mis-classified and cross-contaminated cell lines lead to incorrect conclusions, prevent reproducibility and unnecessarily waste resources. Cell line authentication is therefore crucial to ensure integrity of scientific results.

In the present study, we demonstrate the applicability of RNA-seq-derived sequence variations for robust cell line identification and detection of cross-contamination. We demonstrate that basic machine learning algorithms are sufficient to reliably discriminate between a multitude of distinct cell line models.

Hierarchical clustering based on the presence of detected sequence variations yielded cell type specific clusters. Furthermore, samples containing the same cell line models but sequenced for different studies were also distinguishable with this approach. This indicates that lineage divergence, occurring through passaging, could be monitored by this approach. Furthermore, although not directly related to cell line identification, it suggests the possibility of other applications of RNA-seq derived sequence variations, e.g., longitudinal tracking of genomic changes caused by reagents or factors influencing genomic stability or DNA damage repair mechanisms. Clustering results were revealed to be insensitive to different pre-processing steps and varying sequencing depths. Notably, we further demonstrated the possibility of discriminating cell lines utilizing sequence variations derived from small RNA-seq data, despite far fewer detected variants in these samples.

Using a kNN approach we were able to predict cell line identities from a panel of 22 distinct cell lines, using RNA-seq samples either gathered in-house or from public sources as input. The method can be easily adapted to other cell line panels derived from arbitrary organisms. However, the applied kNN algorithm is rather resource-intensive when applied to larger training data. Machine learning algorithms that generate class representatives instead of using the complete training data for prediction, e.g., k-means or support vector machines, should be appropriate alternatives for reducing runtime and storage demands. Furthermore, the low number of samples per cell line (n = 3) used in this demonstrator kNN needs to be increased to improve generalization capacity of the classifier.

Our topFracCCLE approach reports for a given set of sequence variations the cell line that best matches these variants from a repertoire of more than 1700 human cell line models. We observed that for few of the tested cell lines the best database hit was a cell line model derived from or contaminated with the actual cell line. However, in these cases the fraction of detected sequence variations associated with the correct cell line was always close to the numbers determined for the derivatives/contaminated cell lines. This finding underlines the need for proper cell line identification procedures, since former mis-classifications led to incorrect information content, e.g., in public databases.

Furthermore, we successfully applied this algorithm to detect simulated cross-contaminations of human cell lines in silico and in vitro. Contamination of at least 25 % with another human cell line resulted in both cell lines appearing as the two best hits for the tested cell lines. This seems to be less sensitive than STR profiling, since with this DNA-based approach, all tested contamination ratios yielded the same similarity scores above 70 % for both contained cell lines, proving contamination even at lower levels. However, our method provides distinct advantages compared to STR profiling. Since our method is applied after sequencing, results refer to the actually processed samples. Furthermore, testing cell line identities using topFracCCLE is not associated with additional costs, once RNA-seq has been performed. And finally, the proportion of detected SNPs is correlated with the proportion of a particular cell line’s presence. Consequently, although the actual fraction of variants detected by topFracCCLE varies with each cell line and the relationship between cell line abundance and detected SNP proportion seems to be non-linear, estimations about the actual proportions of the present cell lines appear possible. Such quantitative estimations regarding contamination extent cannot be derived from STR profiling. Since variant detection rates correlate with read numbers per sample, increasing sequencing depth might improve sensitivity of topFracCCLE for detection of cross-contamination.

Our described variant calling approach yields several thousand SNPs per sample. The median number of SNPs per cell line reported in the utilized CCLE dataset was 338. Although we did not observe this problem with the cell lines we tested, we expect that cell lines associated with a very limited set of reported variations might be hard to classify correctly. For example, cell line OUMS-27 is associated with only 16 SNPs in the CCLE dataset. However, PANC-1 cells, associated with 35 SNPs, have been correctly identified in all of our test sets. Nonetheless, in our practical implementation of topFracCCLE, we included an adjustable parameter (minSNPs; default value = 10) for restricting the considered cell lines to those having at least the defined number of SNPs reported in the CCLE dataset. Furthermore, the design of the R-script allows an uncomplicated exchange of the reference data set, thus, alternative data sets that alleviate the problem with low SNPs numbers can be incorporated as reference in future versions.

Although we demonstrated, that RNA-seq-derived sequence variations are sufficient for robust cell line identification, despite being fewer in number than those derived from WGS, a potential bias arises specifically when using RNA as source material. Certain post-transcriptional RNA modifications, such as APOBEC-mediated C-to-U editing, ADAR-mediated A-to-I or non-templated isoMiR variants might interfere with proper variant calling, since without being resolved, such modifications can lead to incorrect conclusions about genuine mutations. Thus, further investigations need to be performed, to elucidate the potential conflicts post-transcriptional RNA modifications can pose with regard to cell line identification.

In summary, our study highlights the potential of RNA-seq derived sequence variations for cell line identification and detection of cross-contamination. We present practical implementations for these tasks yielding reliable results for a multitude of different cell line models.

CRediT authorship contribution statement

Lisa Müller: Writing – review & editing, Writing – original draft, Investigation, Conceptualization. Simon Müller: Writing – review & editing, Investigation. Khursheed Ul Islam Mir: Writing – review & editing, Investigation. Jana Lange: Writing – review & editing, Investigation. Sven Hagemann: Writing – review & editing, Investigation. Alice Wedler: Writing – review & editing, Investigation. Frank Hause: Writing – review & editing, Investigation. Claudia Misiak: Writing – review & editing, Investigation. Danny Misiak: Writing – review & editing, Formal analysis. Tony Gutschner: Writing – review & editing, Supervision. Stefan Hüttelmaier: Writing – review & editing, Supervision. Markus Glaß: Writing – review & editing, Writing – original draft, Software, Formal analysis, Data curation, Conceptualization.

Funding

This work was conducted within the Research Training Group InCuPanc (RTG2751) funded by 10.13039/501100001659Deutsche Forschungsgemeinschaft (DFG), grant number 449501615.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Almeida J.L.Korch C.T.Authentication of human and mouse cell lines by short tandem repeat (str) dna genotype analysis Assay guidance manual 2023 Eli Lilly & Company and the National Center for Advancing Translational Sciences Bethesda (MD)23805434 · pubmed ↗
2Hughes P.Marshall D.Reid Y.Parkes H.Gelber C.The costs of using unauthenticated, over-passaged cell lines: how much more data do we need?Biotechniques 4352007 Nov 575577–8, 581–2 passim 1807258610.2144/000112598 · doi ↗ · pubmed ↗
3Barallon R.Bauer S.R.Butler J.Capes-Davis A.Dirks W.G.Elmore E.Furtado M.Kline M.C.Kohara A.Los G.V.Mac Leod R.A.Masters J.R.Nardone M.Nardone R.M.Nims R.W.Price P.J.Reid Y.A.Shewale J.Sykes G.Steuer A.F.Storts D.R.Thomson J.Taraporewala Z.Alston-Roberts C.Kerrigan L.Recommendation of short tandem repeat profiling for authenticating human cell lines, stem cells, and tissues In Vitro Cell Dev Biol Anim 4692010 Oct 7277322061419710.1007/s 11626-010-9333-z PMC 2965362 · doi ↗ · pubmed ↗
4Fasterius E.Raso C.Kennedy S.Rauch N.Lundin P.Kolch W.Uhlén M.Al-Khalili Szigyarto C.A novel rna sequencing data analysis method for cell line authentication P Lo S One 1222017 Febe 017143510.1371/journal.pone.0171435 PMC 530527728192450 · doi ↗ · pubmed ↗
5Mohammad T.A.Tsai Y.S.Ameer S.Chen H.-I.H.Chiu Y.-C.Chen Y.Ce L-ID: cell line identification using RNA-seq data BMC Genomics 202019 Feb 813071251110.1186/s 12864-018-5371-9PMC 6360649 · doi ↗ · pubmed ↗
6Barretina J.Caponigro G.Stransky N.Venkatesan K.Margolin A.A.Kim S.Wilson C.J.r J.Kryukov G.V.Sonkin D.Reddy A.Liu M.Murray L.Berger M.F.Monahan J.E.Morais P.Meltzer J.Korejwa A.Valbuena J.Mapa F.A.Thibault J.Bric-Furlong E.Raman P.Shipway A.Engels I.H.Cheng J.Yu G.K.Yu J.Aspesi P.de Silva M.Jagtap K.Jones M.D.Wang L.Hatton C.Palescandolo E.Gupta S.Mahan S.Sougnez C.Onofrio R.C.Liefeld T.Mac Conaill L.Winckler W.Reich M.Li N.Mesirov J.P.Gabriel S.B.Getz G.Ardlie K.Chan V.Myer V.E.Weber B.L.Porter J.Warmuth M.Finan P.Harris J.L.Meyerson M.Golub T.R.Morr · doi ↗ · pubmed ↗
7Dirks W.G.Mac Leod R.A.F.Nakamura Y.Kohara A.Reid Y.Milch H.Drexler H.G.Mizusawa H.Cell line cross-contamination initiative: an interactive reference database of str profiles covering common cancer cell lines Int J Cancer 12612010 Jan 3033041985991310.1002/ijc.24999 · doi ↗ · pubmed ↗
8Capes-Davis A.Reid Y.A.Kline M.C.Storts D.R.Strauss E.Dirks W.G.Drexler H.G.Mac Leod R.A.F.Sykes G.Kohara A.Nakamura Y.Elmore E.Nims R.W.Alston-Roberts C.Barallon R.Los G.V.Nardone R.M.Price P.J.Steuer A.Thomson J.Masters J.R.W.Kerrigan L.Match criteria for human cell line authentication: where do we draw the line?Int J Cancer 132112013 Jun 251025192313603810.1002/ijc.27931 · doi ↗ · pubmed ↗