SigRescueR: a pan-system framework for noise correction and mutational signature identification across sequencing platforms
Peter T Nguyen, Maria Zhivagui

TL;DR
SigRescueR is a new computational tool that improves accuracy in identifying cancer-related mutational signatures by reducing noise from sequencing data.
Contribution
Introduces SigRescueR, a Bayesian framework for noise correction and mutational signature identification across sequencing platforms.
Findings
SigRescueR reliably identifies mutational signatures from environmental mutagens and chemotherapeutic agents.
The framework operates across diverse mutation classes and integrates strand bias and duplex sequencing data.
It enables precise mapping of mutagenic processes and identification of genomic biomarkers for exposures.
Abstract
Mutational signatures serve as molecular fingerprints of the biological processes and exposures that shape cancer genomes. However, accurate signal recovery remains challenging due to pervasive background variants, sequencing artifacts, technical noise, and platform-specific biases that obscure true mutagenic patterns, hampering biomarker discovery, and mechanistic interpretation. Here we introduce SigRescueR, a rigorous, pan-system, computational framework based on Bayesian inference designed for noise correction and mutational signature identification. SigRescueR applies statistically robust baseline correction to effectively disentangle true mutational signals from confounding noise and artifacts. When applied to extensive datasets spanning experimental models and human cancers, SigRescueR reliably identified canonical mutational signatures associated with environmental mutagens…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7- —University of Nevada Las Vegas and partially by the Centers of Biomedical Research Excellence (COBRE) Phase I
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetic Associations and Epidemiology · Genomics and Phylogenetic Studies · Genomics and Rare Diseases
Introduction
Cancer is fundamentally driven by somatic mutations arising as the result of Darwinian evolution where a single cell evades growth-control mechanisms and begins to divide uncontrollably [1–4]. These somatic mutations result from mutational processes that are characterized by an interplay of DNA damage, repair, and replication [5–9]. The origins of these mutational processes are diverse, encompassing both exogenous factors, such as environmental carcinogens, and endogenous sources including cellular aging and genomic instability [10–14]. Advances in analyzing mutational patterns in cancer genomes have revealed the sources of somatic mutations and identified key risk factors [5, 10, 15–20]. These signatures are systematically cataloged in the COSMIC database, serving as a key resource for cancer genome studies [10]. Experimental studies, both in vivo and in vitro, have validated several signatures observed in cancer patients, and uncovered novel mutagenic exposures and mechanisms [13, 14, 21–27]. However, a persistent challenge is that these mutational signatures often contain substantial noise arising from endogenous processes [7, 28–30], cell culture artifacts [31], and sequencing errors [32]. This confounding background noise introduces variability between samples, distorting the true biological mutational signatures. Moreover, extensive analysis of 23 829 samples (19 184 whole-exome and 4645 whole-genome sequencing) have identified 19 single-base substitutions (SBS) artifactual signatures [10], including batch effects (SBS95) [33], germline variant associations (SBS54) [10], and sequencing artifacts (SBS45) [10, 32]. Therefore, effective removal of baseline and artifactual mutations is essential not only for experimental models but also for tumor samples, to reveal subtle genomic patterns and uncover hidden insights into cancer biology.
To address this challenge, two strategies have been employed to remove baseline mutational processes. The first relies on straightforward exposure-baseline subtraction, which assumes that mutations absent in the baseline are exposure-associated. However, this method falters when the baseline mutation burden exceeds the exposure, producing negative values that must be replaced by zeros [25, 34]. The second strategy leverages non-negative matrix factorization (NMF) [35] to decompose mutation matrices and separate control from exposure-related signatures [36]. Yet NMF’s performance depends heavily on sample size [37], rendering it impractical with limited replicates [38]. Another approach that can be considered is the non-negative least squares (NNLS) algorithm [39]. NNLS can assign background activity which can then be subtracted from exposed samples. This method is particularly well-suited for analyses of individual samples. Nevertheless, NNLS’s propensity to overfit can lead to negative values [40]. Collectively, these limitations highlight the urgent need for a burden-insensitive approach that can effectively remove baseline effects without producing negative attributions, even when the background profile dominates or closely resembles the treatment profile.
We developed SigRescueR, a robust R package that applies Bayesian inference to remove baseline mutational patterns from exposure-derived mutational profiles. We demonstrate that SigRescueR preserves the integrity of the original data, enabling near-perfect reconstruction of mutational profiles, including accurate retention of mutational burden. By leveraging COSMIC signatures as a reference standard, SigRescueR significantly improved similarity of the inferred treatment signatures to known ground truths. We demonstrate that SigRescueR supports a wide range of SBS classifications, including SBS96 and SBS288 channels, demonstrating impressive versatility. In rigorous comparisons, SigRescueR outperformed existing tools while uniquely providing credible intervals that quantify uncertainty, a crucial advantage over methods that offer only point estimates. SigRescueR adapts seamlessly to diverse models, species, patients’ data, and sequencing platforms, including cutting-edge duplex DNA sequencing, making it an indispensable asset for mutational signature analysis and toxicology studies.
Materials and methods
Overview of SigRescueR
SigRescueR was developed to disentangle baseline mutational patterns from those induced by exposure or treatment (Fig. 1). The algorithm employs Bayesian inference to jointly model both the reconstructed mutational profile and the total mutation count at the mutational context level. Specifically, SigRescueR utilizes count-based observed mutational spectra (“Observed”) and a proportion-based background mutational signature (s_b_). The reconstructed mutational profile (“Reconstructed”) is represented as a combination of the background signature s_b_ and latent exposure signature (s_e_), each weighted by their respective activity parameters (θ_b_, θ_e_), and scaled by the observed mutation burden to recover mutation counts.
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $$Reconstructed= Mutation\ Burden\ast \left({\mathrm{\theta}}_b\ast{s}_b+{\theta}_e\ast{s}_e\right)$$\end{document}Overview of SigRescueR. Using the observed (exposed sample) and background (unexposed sample or provided) mutational profiles, SigRescueR applies Bayesian inference with multiple iterations and chains to subtract the background profile (gray bars) from the observed profile (a mixture of exposure- and endogenous-associated mutations). This process rescues the exposure-associated signal (black bars). The method utilizes count data from the exposed sample and the normalized background profile to estimate both background and exposure activities while inferring the exposure-associated profile. By exploring the parameter space, SigRescueR generates a full posterior distribution capturing uncertainty inherent in the data.
SigRescueR incorporates an internally defined Gamma prior that assumes θ_b_ is higher than θ_e_. This prior enforces positivity and models a skewed distribution centered around 1 for θ_b_ and 0.2 for θ_e_. This design prevents the exposure signature from dominating the reconstructed profile by absorbing the majority of the signal. Importantly, the Bayesian prior is automatically updated by the observed data to yield posterior estimates of exposures that are data-driven rather than prior-driven.
The observed mutation spectra proportions were modeled using a Dirichlet distribution parameterized by the reconstructed mutation spectra proportions.
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\rho Observed\sim Dirichlet\left(\rho Reconstructed\right)$$\end{document}The observed counts were modeled using a Conway–Maxwell Poisson (COM-Poisson) distribution [41] parameterized by the reconstructed mutation counts. The dispersion parameter (V) of the COM-Poisson was constrained with a Normal prior centered at 2, with a lower bound of 1.1 to capture underdispersion and to penalize large deviations from the observed counts.
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $$Observed\sim COM- Poisson\left( Reconstructed,\upsilon \right)$$\end{document}Implementation of the COM-Poisson requires computation of a normalizing constant Z for all mutation contexts to ensure probabilities sum up to one. Since the normalizing constant Z is an infinite sum that lacks a closed form expression, it cannot be directly computed. To address this, we employed an adaptive truncation sum strategy in which the upper bound equals the sum of observed mutation count and six times the square root of the observed mutation count, followed by additional padding of 10, reducing computational time.
To encourage accurate reconstruction, parameter configurations yielding a cosine similarity greater than 0.95 between the reconstructed and observed profiles were positively rewarded.
To accommodate variability among samples with the same exposure, a weighted mutational profile was generated where each sample (i) contributed proportionally to its total mutation burden (TMB). The weighted profile was computed as the sum of each sample’s mutational profile (mi) multiplied by its corresponding weight (wi).
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $${w}_i=\frac{TMB_i}{\sum_{i=1}^n{TMB}_i}$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $$Weighted\ Profile=\sum_{i=1}^n{w}_i\ast{m}_i$$\end{document}SigRescueR implementation
SigRescueR is implemented in R [42] and applies Bayesian inference using Rstan [43], requiring only two input data frames: an exposure and background mutation profile. No additional user-specified parameters or priors are needed. SigRescueR consists of three main functions that are executed in sequential order: (i) SigRescueSetup for configuring model settings, (ii) SigRescueRun for performing Bayesian inference, and (iii) SigRescueAnalyze for summarizing the posterior estimates. Bayesian inference is configured to run four Markov Chain Monte Carlo (MCMC) chains with 1000 warm up iterations followed by 1500 sampling iterations for each chain, which can be reconfigured within the SigRescueSetup function to balance estimation precision and computational runtime. MCMC chains are configured to run in parallel with each chain assigned to a single core. To reduce runtime, SigRescueR supports parallelizing computation of up to 20 samples implemented in parallel [42]. The SigRescueAnalyze function processes the output of SigRescueRun function by extracting the posterior estimates of θ_b_ and θ_e_ using their median value and the lower bound of the 95% credible interval (2.5%) of s_e_ as representative estimates. The exposure-associated profile is generated by multiplying s_e_ (normalized, denoised exposure profile) by θ_e_ (exposure activity) and mutation burden. Finally, SigRescueAnalyze computes the consensus rescued profile, a weighted average of rescued profiles across samples, to represent the mutational signature of interest.
Evaluation of reconstruction
To evaluate reconstruction accuracy, we benchmarked the reconstructed against the original spectra using count accuracy, Jensen–Shannon divergence (JSD) [44], and cosine similarity [45]. To assess the recovery of true biological signals, similarity between the exposure signature and COSMIC signatures was performed. JSD quantifies differences in relative proportions, where values near 0 denote high similarity.
Benchmarking cleaning performance
To assess SigRescueR’s performance, we provided identical input for direct comparison against simple subtraction [25, 34], NNLS [46], ExpSigfinder [47], and SparseSignatures [48] (Supplementary Table 1). The simple subtraction method subtracts the baseline mutational spectra from the exposed mutational spectra and clips negative values resulting from the subtraction to zero, representing the most basic form of cleaning. We used NNLS to estimate the baseline activity given the baseline and exposure profile. In this approach, NNLS will compute the baseline activity under non-negative constraints that best fit the exposure profile. Mutations not explained by the background are considered the residual, which effectively represents the exposure-associated profile. ExpSigfinder applies bootstrapping to estimate baseline variability and iteratively removes baseline contributions from the exposure profile under non-negative constraints. SparseSignatures employs an NMF framework, incorporating a “LASSO” penalty, for mutational signature discovery based on a supplied fixed background profile to separate a ubiquitous process from observed matrices. In this approach, the background signature is provided as a fixed parameter, enabling separation of the background from the exposure-associated profile. Count accuracy, reconstruction cosine similarity, and COSMIC cosine similarity were used as metrics to evaluate performance. Count accuracy was measured as the difference in mutation counts where values close to 0 indicate minimal deviation from the original mutation count. Reconstruction cosine similarity was computed between the reconstructed and original mutational spectra. COSMIC cosine similarity was computed between the inferred exposure mutational signature and COSMIC signature.
Synthetic injection of single-base substitutions signatures
To perform the synthetic injection of artifactual signatures COSMIC SBS45 and SBS54, we randomly selected two independent groups: 100 SBS45-negative samples and a non-overlapping set of 100 SBS54-negative samples, as defined by SigProfilerAssignment [49]. For each sample, the level of synthetic injection was randomly assigned, ranging between 5% and 95%, representing the fraction of the total mutational profile composed of synthetic SBS45 or SBS54 mutations. To determine the number of synthetic mutations to inject, the mutation burden was multiplied by the ratio of synthetic to original injection level (synthetic injection / 1 – synthetic injection). This value was distributed according to the normalized COSMIC signature for SBS45/SBS54 and then added to the existing mutation in the original profile.
After injection, we evaluated SigRescueR’s ability to identify injections while preserving the original spectra as measured by the precision (accuracy of true positives) and sensitivity (detection of true positives). We also computed the F1 score to assess the balance between precision and sensitivity.
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $$Precision=\frac{TP}{TP+ FP}$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $$Sensitivity=\frac{TP}{TP+ FN}$$\end{document}where TP denotes correctly identified injections, FP denotes misclassified injections, and FN denotes missed injections.
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $$F1=2\ast \frac{Precision\ast Sensitivity}{Precision+ Sensitivity}$$\end{document}Statistics and reproducibility
All pairwise statistical comparisons were performed using Mann–Whitney U test [50]. Overall statistical differences between groups were performed using Anova [51]. JSD and cosine similarity were computed to quantify the similarity. All SBS attributions were performed using SigProfilerAssignment [49].
Results
Baseline correction increases similarity to reference
SigRescueR implements statistically rigorous baseline correction to separate true mutational signatures from background noise and technical artifacts, significantly improving the detection of biologically meaningful mutational processes (Fig. 1). To evaluate how well SigRescueR removes baseline mutations, we analyzed diverse experimental models, including induced pluripotent stem cells (iPSCs) [47], immortalized N/TERT keratinocytes (NTERT1) [52], human intestinal organoids (HIO) [25], mouse embryonic fibroblasts (MEFs) [21], and mouse intestinal organoid [53] (Supplementary Table 2). These models represent various exposures with known COSMIC signatures (Supplementary Table 3). The samples exhibited highly variable mutational burdens, ranging from 369 to 4427 mutations in human models (n = 26) and from 595 to 24 947 mutations in mouse models (n = 23). For each model, samples treated with non-mutagenic conditions were used as backgrounds. We, first, used colibactin-treated HIO to extract the associated mutational signature and reconstruct the original profile (Fig. 2a). The reconstructed profile demonstrated close agreement with the observed profile with a cosine similarity of 0.963. The posterior mean activities were 0.474 for exposure and 0.526 for background signatures, supported by convergence across chains (Supplementary Fig. 1A–C).
Background cleaning of an HIO sample exposed to colibactin. (a) SBS-96 mutational signature plot showing the uncleaned (Original), control (Background), inferred treatment (Exposure), and reconstructed profiles. Alpha (⍺) transparency indicates background (⍺ = 0.5) and exposure (⍺ = 1) levels in the reconstructed profile. (b) Barplot depicting relative background activity estimated using multiple backgrounds (Ind.) and a single background (Mean), with fill color denoting individual replicates in the Ind. column. (c) Barplot showing cosine similarity to the reference signature (SBS88) before (UC = uncleaned) and after baseline correction (C = cleaned). (d) Weighted SBS-96 mutational signature plot comparing the inferred exposure (Cleaned) with the COSMIC reference (SBS88). The total mutation count in the cleaned profile is displayed within the plot. Cosine similarity scores reflect similarity between the cleaned signature and SBS88. (e) Posterior distributions of exposure (red-filled densities: e) and background activity (gray-filled densities: b) across HIO replicates, with each ridge representing a single sample.
Moreover, baseline correction was further evaluated by comparing the use of multiple background profiles to a single composite background profile, generated by a weighted combination of all available background profiles. Results indicated that baseline correction utilizing multiple profiles yielded a higher background activity parameter (θ_b_ = 0.526) compared to correction using a single profile (θ_b_ = 0.402) (Fig. 2b). As expected, this enhancement is likely attributable to the greater ability of multiple backgrounds to capture variability in the observed mutational profiles that a single background may not represent (Supplementary Fig. 2). To assess consistency, SigRescueR was applied to the same sample across 100 independent iterations. The residuals of exposure activity were consistently minimal (residual <0.0011), and the inferred exposure mutational signature exhibited high stability (cosine similarity >0.99) (Supplementary Fig. 1D–E), substantiating the robustness of the method. Additionally, we conducted a Bayesian prior sensitivity analysis across a range of Gamma priors (Gamma(1,10), Gamma(1,5), Gamma(1,2), and Gamma(1,1)), spanning strong to weak priors with decreasing shrinkage and increasing prior mean (Supplementary Fig. 1F). Posterior estimates showed substantial overlap across Bayesian priors (Supplementary Fig. 1G–H), demonstrating that posterior estimates are data-driven rather than prior-driven. Consistent with Supplementary Fig. 1E, the inferred exposure mutational profile remained highly consistent across the Bayesian priors with cosine similarity scores greater than 0.99 (Supplementary Fig. 1I). To evaluate the effectiveness of baseline correction, we computed the cosine similarity between SBS88, a colibactin-induced COSMIC mutational signature [25, 26, 54], and the colibactin-exposure signature inferred by SigRescueR. Baseline correction substantially improved the cosine similarity from 0.78 to 0.96, a 23% enhancement (Fig. 2c). Further, when aggregating the weighted mutational profiles from all colibactin-exposed HIO (n = 8), the resulting profile exhibited a cosine similarity of 0.992 to SBS88, indicating near-perfect concordance (Fig. 2d). Notably, assessment of exposure activity across biological replicates demonstrated pronounced variability, with several replicates showing higher background activity than exposure-related activity, highlighting biological and technical heterogeneity within experimental replicates from a single study (Fig. 2e, Supplementary Fig. 3A).
Baseline correction achieves high reconstruction fidelity
Following the strong concordance observed between the inferred exposure signature and the COSMIC reference, reconstruction accuracy was systematically assessed by comparing mutation counts and profile similarity between the reconstructed and observed mutational spectra. Our analysis revealed a strong linear correlation in mutation counts across samples (Fig. 3a). Furthermore, SigRescueR achieved a near-perfect cosine similarity of 0.99 between observed and reconstructed mutational profiles (Fig. 3a), confirming the high fidelity of the reconstruction. To evaluate reproducibility, we examined concordance of the inferred exposure mutational signatures across models and replicates. Consistent high cosine similarity among replicates within each model reflected robust reproducibility (Fig. 3b). Notably, exposure signatures generated from distinct models subjected to the same treatment, namely NTERT1 and iPSCs exposed to simulated solar radiation (SSR), exhibited strong similarity (cosine similarity = 0.91), underscoring the consistent mutational response across disparate cellular backgrounds. Moreover, consistent with the colibactin-treated samples (Fig. 2e), SigRescueR reliably produces highly stable signatures across replicates despite variability in exposure levels and baseline mutation burden (Fig. 3b, Supplementary Fig. 3A).
Performance evaluation of reconstruction, stability, and fidelity compared to the original and ground truth profiles. (a) Scatter plot comparing TMB between original and reconstructed mutational profiles. Each dot represents a single sample exposure experiment and is colored according to their cosine similarity score between the original and reconstructed profiles. The cosine similarity score is summarized by the mean, with minimum and maximum values indicated in brackets. (b) Heatmap showing pairwise cosine similarity between each model and its replicates. (c) Boxplot summarizing cosine similarity to their COSMIC reference, separated by original (UC = uncleaned) and inferred exposure mutational signature (C = cleaned). Each dot represents a single sample. (d) Dot plot comparing the cosine similarity between the SigRescueR-cleaned profiles to NMF-based de novo signatures. Each dot represents the consensus rescued profile, the weighted average of rescued profiles across samples, colored according to the de novo signature with the highest similarity and sized by Pearson’s correlation. (e) Left panel: SBS-96 mutational signature plot comparing the inferred exposure (Cleaned) to SBS7a for NTERT1 exposed to SSR. The cosine similarity score is indicated between the cleaned signature and SBS7a. Right panel: Donut plot showing the relative contributions of COSMIC mutational signatures attributed to the cleaned signature. Each color represents a distinct COSMIC signature with percentages indicated. (f) ID-83 mutational signature plot comparing cleaned and original (uncleaned) profiles for NTERT1 co-exposed to SSR and arsenic. The total number of mutations in the cleaned profile is shown within the plots.
Notably, in certain instances, exposures may be non-mutagenic, resulting in mutational profiles indistinguishable from background patterns. Consequently, it is critical that SigRescueR attributes all mutations to background, avoiding false positive assignment. To evaluate this capability, we analyzed mouse intestinal organoid samples subjected to a high-fat diet (HFD) versus a standard diet (SD), known to generate highly similar profiles (cosine similarity = 0.993) [53]. Applying SigRescueR to the HFD samples with SD samples as background, we observed that nearly all mutations were correctly attributed to the background (Supplementary Fig. 3B), with an average inferred exposure activity of 1.3%.
Furthermore, a critical question emerged: do these inferred exposure mutational signatures capture biologically meaningful signatures or are they statistical artifacts of Bayesian inference? To address this, we assessed the concordance between the modeled exposure mutational signatures and their COSMIC reference. Compared to the original profiles, we observed a marked increase in cosine similarity to the reference (ANOVA P-value = .002) (Fig. 3c). Notably, the magnitude of improvement and the ability of the inferred exposure mutation signature to approach near-perfect concordance varied between replicates—an expected reflection of inherent sample-to-sample biological variability. In models exposed to SSR and aristolochic acid I (AAI), the original profiles exhibited strong concordance with SBS7a, a UV-associated signature [55], and SBS22a, AAI-attributed signature [21, 56], respectively, with cosine similarity greater than 0.9. Baseline correction further improved similarity. Conversely, two HIO treated with colibactin initially showed low similarity to SBS88 [25, 26, 54], but baseline correction substantially elevated cosine similarity from 0.58 to 0.86 and 0.61 to 0.84 in replicates 4 and 5, corresponding to improvements of 47.8% and 38.5%, respectively (Fig. 3c). Furthermore, to evaluate the accuracy of SigRescueR’s signature identification, we analyzed 26 compound-exposed samples across four treatments (Fig. 3b). Following denoising, we applied NMF using SigProfilerExtractor, which identified four de novo mutational signatures from these cleaned profiles. Comparison of the SigRescueR consensus rescued profiles with the de novo NMF signatures revealed strong concordance, with cosine similarity and correlation values greater than 0.97 (Fig. 3d, Supplementary Fig. 4). These findings demonstrate that SigRescueR can accurately recover mutational signatures that are highly concordant with de novo NMF extractions, while operating on few exposed samples. Collectively, these results demonstrate SigRescueR’s robustness to extract biologically meaningful signatures across diverse experimental contexts, while maintaining insensitivity to activity magnitude.
Next, we analyzed NTERT1 cells exposed to SSR, focusing on similarity to SBS7a [55]. SigRescueR demonstrated reliable baseline correction, with a mean baseline activity of 0.104, confirming its ability to preserve biologically relevant signals (Fig. 3E, left). Given that SSR induces a complex mixture of UV-associated signatures, decomposition of the corrected profile revealed SBS7a, SBS7b, and SBS7c [57] as major contributors, explaining the cosine similarity of 0.92 observed (Fig. 3e, right).
Effective baseline correction beyond SBS
Beyond SBS, baseline correction was applied to insertion–deletion mutations (indels) in NTERT1 cells co-exposed to SSR and arsenic [52]. This resulted in an increase in cosine similarity to ID13 [10], a UV-associated signature, between the original and cleaned profile, from 0.592 to 0.974 (Fig. 3f). Moreover, correction of doublet base substitutions (DBS) maintained high concordance with the UV-associated DBS1 signature [58], with cosine similarity reaching 0.994 (from 0.986 to 0.994; Supplementary Fig. 5).
We further validated SigRescueR’s versatility using indel mutations from HIO exposed to colibactin. Consistent high cosine similarity was observed, increasing from 0.94 to 0.97 toward the colibactin-associated ID18 signature [54] (Supplementary Fig. 6A). Although an established DBS signature for colibactin is lacking, applying baseline correction to DBS mutations revealed a baseline activity of 10.3%, reflecting the background contribution to the total profile (Supplementary Fig. 6B). Collectively, these findings highlight SigRescueR’s robustness and broad applicability across various mutational categories.
Accuracy of baseline correction using strand-aware single-base substitutions classification
Accurate identification of transcription strand bias (TSB) is essential to understand how specific agents cause strand-specific DNA damage and repair, revealing key drug mechanisms and mutational outcomes [10, 59]. To further elucidate the impact of TSB on mutational signature rescue, we extended our analysis to SBS288 classifications. We applied SigRescueR across a range of biologically relevant samples: MEFs exposed to benzo[a]pyrene (B[a]P), AAI and ultraviolet light (UVC); BEAS-2B exposed to B[a]P; iPSCs exposed to AAI, cisplatin, and SSR; and colibactin-treated HIO [21, 25, 34, 47, 52] (Fig. 4, Supplementary Fig. 6C).
Beyond SBS96: integrating TSB information for cleanup. SBS-288 mutational signature plots show cleaned and original mutational spectra for (a) MEF exposed to B[a]P, (b) BEAS-2B exposed to B[a]P, (c) MEF exposed to AAI, and (d) iPSC exposed to AAI. Colors indicate mutations on transcribed (T) and untranscribed (U) strands. (e) Barplot summarizing cosine similarity to the reference, separated by original (UC = uncleaned) and inferred exposure mutational signatures (C = cleaned). Each dot represents a single sample. The total mutation count in the cleaned profiles is indicated within the plots. Cosine similarity scores compare reconstructed and original mutational spectra.
Incorporating TSB information reinforced the robustness of our approach across more granular mutational classifications. Similarity to the tobacco-associated signature SBS4 [5] improved in MEFs, rising from 0.85 to 0.958 at the SBS96 level (Supplementary Fig. 7A) and further to 0.963 at the SBS288 level (Fig. 4a), while exposure activity remained stable (Supplementary Fig. 7B). In BEAS-2B cells exposed to B[a]P, the cosine similarity before and after cleanup using the SBS288 channel remained stable, with approximately 6% baseline profile removed, while preserving TSB (Fig. 4b). Parallel improvements were observed in MEFs treated with AAI and UVC when compared to SBS22a [21, 56] and SBS7a [55], respectively (Fig. 4c, Supplementary Fig. 7C and D). Similarly, in AAI-treated iPSCs, cosine similarity increased modestly from 0.939 to 0.953 (Fig. 4d). Overall, SigRescueR maintains or improves cosine similarity with strand bias information (Fig. 4e) without affecting reconstruction accuracy (average cosine similarity of 0.98) (Fig. 4a–d), demonstrating that SigRescueR’s accuracy extends well to strand-aware mutational signature classifications.
Benchmarking against existing methods demonstrates rigor and outperformance
We benchmarked SigRescueR against several established approaches, including simple subtraction [25, 34], NNLS [39], ExpSigfinder [47], and SparseSignatures [38]. Given NNLS’s known propensity for overfitting [40], we implemented a 90% shrinkage strategy, where only 90% of the inferred baseline mutational signature is subtracted from the observed profile.
The main premise of baseline correction through signature decomposition is to parse the mutational spectra into known and, where applicable, latent signatures, while preserving the total mutation count. Across the five evaluated methods, SigRescueR excelled by closely matching the TMB between the reconstructed and the original profiles, outperforming simple subtraction (53.3 [32.5; 74.1]), NNLS (561 [236 886]), ExpSigfinder (8.3[6.1,10.5]), and SparseSignatures (−77.1 [−95.8; −53.8]) (Fig. 5a). Notably, subtraction and NNLS often attributed baseline activity per mutation context that exceeded the original mutation counts, resulting in negative mutation contexts, averaging 26 and 51 negative channels, respectively. Since negative mutations are biologically implausible and indicative of modeling artifacts, truncating to zero elevated mutation counts during reconstruction. Conversely, SparseSignatures underestimated mutation counts by leaving some mutations unassigned, resulting in lower reconstructed mutations. This comparison underscores SigRescueR’s balanced ability to accurately reconstruct mutation counts while avoiding biologically implausible artifacts. While SparseSignatures demonstrated remarkable alignment between reconstructed and original spectra, SigRescueR also delivered excellent performance, achieving a mean cosine similarity of 0.981 across 26 tested samples (Supplementary Fig. 8).
Models comparison. (a) Dot-and-whisker plot depicting count accuracy measured as the difference in total mutation count between reconstructed and observed profiles. Each dot represents the mean, and whiskers indicate the 95% range across all samples. The red dashed line marks perfect count accuracy. (b) Boxplot summarizing cosine similarity to the reference, separated by cleaning methods. (c) SBS-96 mutational signature plots showing the uncleaned (Original), untreated (Baseline), and the inferred exposure mutational signatures by five methods using iPSC exposed to cisplatin. Total mutation counts in cleaned profiles are indicated within the plots. Values in parentheses represent the cosine similarity between the inferred exposure profile and SBS31.
Moreover, we computed cosine similarity for exposure signatures inferred from simple subtraction, NNLS, ExpSigfinder, and SparseSignatures, against COSMIC reference, using SigRescueR as the benchmark (Fig. 5b). SigRescueR consistently demonstrated superior performance over competing methods, with variations in cosine similarity among top-performing samples averaging only 0.0125 (Fig. 5b). Figure 5C illustrates the signatures inferred across the five methods for cisplatin-treated iPSCs. Simple subtraction and NNLS yielded similar cosine similarity scores to SBS31 [24], a platinum-based chemotherapy COSMIC signature, at 0.88 and 0.87, respectively, achieved by clipping negative channels to zero. ExpSigfinder removed 30% of baseline signature, yielding a cosine similarity score of 0.78 due to inefficient cleanup and persistent leakage of background patterns into the denoised result. In contrast, SparseSignatures failed to detect any baseline signatures, while SigRescueR effectively removed baseline mutations and achieved the highest cosine similarity score equaling 0.903 to SBS31 signature.
Case studies using simulated and real patients’ tumor data
Unlike controlled experimental models, the human cancer genome accumulates mutations from a complex interplay of multiple mutational processes [10, 11], making it inherently noisier. We evaluated SigRescueR’s effectiveness in detecting and removing artifactual signatures, specifically SBS45 [32], caused by DNA shearing and 8-oxoguanine introduction during sequencing, and SBS54 [10], putatively linked to sequencing artifact. We injected varying levels of each signature into 100 randomly selected negative samples as confirmed by SigProfilerAssignment [49] (Supplementary Fig. 9A). Further, we applied the same evaluation to NNLS [46], ExpSigfinder [47], and SparseSignatures [48] for benchmarking. To assess performance, we computed precision and sensitivity (Fig. 6, Supplementary Fig. 9A). Among the four evaluated tools, SigRescueR achieved the highest overall F1 score for both SBS45 and SBS54, followed by NNLS in second place, ExpSigfinder in third and SparseSignatures in fourth (Fig. 6). While SigRescueR, NNLS and ExpSigfinder displayed high sensitivity, SigRescueR outperformed NNLS and ExpSigfinder by producing fewer false-positive calls, with precision values of 0.956, 0.879 and 0.826 for SBS45, and 0.956, 0.883 and 0.853 for SBS54, respectively. In contrast, SparseSignatures showed inconsistent performance, with high F1 scores for some samples and failing to detect the signatures in others (Supplementary Fig. 9A). Interestingly, several samples injected with SBS45 resulted in a precision less than 0.9 (n = 11), with precision dropping as low as 0.576 in one sample. Of the 11 samples, the true level of injections fell within the credible interval for six samples, appropriately accounting for the uncertainty. For the five remaining samples, four samples lay slightly outside the credible interval with a difference of 0.01 from the true injection value, while the fifth sample had a difference of 0.14. The overcalled mutations are likely a result of SBS45-like patterns within other COSMIC signatures. Applying SigRescueR across all COSMIC signatures revealed SBS45 attributions (Supplementary Table 4). For example, when analyzing SBS4, ~33% of extracted mutations were attributed to SBS45. Thereafter, we reassessed these five samples by combining the SBS45 attribution detected before injection with the true SBS45 injection values. This operation accounted for the undetected SBS45, resulting in an estimate that aligns with the initial posterior distribution and confirming that the low precision was due to undetected SBS45 signatures present in the original samples. For example, in PD51293a, the level of injection was 21%, yet the credible interval spanned 22%–27%. Cleaning the original profile revealed an additional 4.4% SBS45, which brings the estimate within the credible interval. Moreover, when comparing the original to the cleaned injected spectra, we observed strong preservation of the original spectra with a mean cosine similarity of 0.97 and 0.98 for SBS45 and SBS54, respectively (Supplementary Fig. 9B).
SigRescueR performance on simulated and real tumor data. Scatter plot showing the average precision and sensitivity for cleaning SBS45 or SBS54 across 100 randomly selected samples negative for these signatures. Results are compared between SigRescueR, SparseSignatures, ExpSigfinder, and NNLS. Each dot is colored according to the cleaning method used.
After validating SigRescueR’s performance, we applied the tool to real tumor samples harboring SBS45 and SBS54 signatures. SigRescueR effectively removed these signatures, as confirmed by SigProfilerAssignment [49] (Supplementary Fig. 9C–D). Importantly, we found that cleaning SBS45 preserved the integrity of existing SBS signatures. However, in one sample, it led to reassignment between two signatures with overlapping profiles. Specifically, SBS41 was reassigned to SBS93, and SBS5 to SBS40a (Supplementary Fig. 9C). This finding demonstrates SigRescueR’s high sensitivity in accurately distinguishing overlapping mutational patterns without broadly disrupting the mutational landscape. Furthermore, by removing artifactual signatures, bona fide signatures can emerge more clearly, offering biologically meaningful insights into the mutational processes driving cancer development. This underscores SigRescueR’s value not only for refining mutational profiles but also as a powerful tool for uncovering clinically relevant signatures.
Extending SigRescueR applications to duplex sequencing data
Error-corrected next-generation sequencing (ecNGS) is an emerging technology that provides substantial advantages over conventional whole-genome sequencing approaches by markedly reducing sequencing errors and enhancing the detection of low-frequency variants [13, 60]. Among ecNGS platforms, single-molecule duplex sequencing, based on sequencing both DNA strands independently to build consensus variant calls, has gained prominence as one of the most innovative and accurate methods available today [60]. Several duplex-based frameworks have been developed to improve the detection of exposure-related mutational signatures without the need for clonal expansion, thereby minimizing culture-induced artifacts [61–64]. However, despite their precision, these methods remain prone to technology-related artifacts. These residual artifacts obscure true biological signals and complicate the identification of true mutational signatures.
To address this challenge, we applied SigRescueR on duplex sequencing data from patients treated with 5-fluorouracil (5-FU) [65] using naïve and memory B and T cells, and monocytes. SigRescueR enabled the recovery of SBS17b, a COSMIC signature commonly ascribed to 5-FU treatment [66, 67] (Fig. 7a and b). In one patient, attribution to SBS17b increased from 411 to 571. Notably, SigRescueR revealed attribution to SBS17b in eight 5-FU samples that were previously negative for this signature (Fig. 7b), highlighting SigRescueR’s capability to recover treatment-induced signatures that were previously undetected.
SigRescueR performance with ecNGS. (a) SBS-96 mutational signature plots showing three unexposed memory T cell profiles (Normal) and inferred exposure signatures (Cleaned) alongside uncorrected original mutational spectra (Original) from an individual previously exposed to 5-fluorouracil. (b) Connected dot plot illustrating mutation counts attributed to SBS17b before (UC = uncleaned) and after cleaning (C = cleaned) across various individuals and cell types; each connected pair represents one sample. Statistical significance (P-value) was calculated using the Wilcoxon rank sum test. Scatter plots showing mutation counts for (c) B[a]P and (d) AAI across different concentrations before and after cleaning; each dot is a sample; fitted linear regression lines with equation and R2 values are shown. (e) Barplot showing residual error measured by RSME, separated by original (UC = uncleaned) and cleaned (C = cleaned) profiles for B[a]P and AAI. (f) SBS-96 mutational signature plots comparing inferred exposure (Cleaned) and uncorrected spectra (Original) for the 3D HepG2 model exposed to AAI. (g) Barplot summarizing cosine similarity of the 3D HepG2 model to SBS22a between original (UC = uncleaned) and cleaned (C = cleaned) profiles.
Next, we utilized duplex sequencing data from a mouse bone marrow model exposed to B[a]P across different dosage levels [68]. SigRescueR maintained a clear dose–response relationship in mutational impact and enhanced the linear correlation between dose and mutation frequency, reflected by a 2% increase R^2^ value (Fig. 7c). Applying SigRescueR on 3D HepG2 spheroids exposed to AAI [69] revealed a congruent trend, where R^2^ increased from 0.94 to 0.95 (Fig. 7d). This improvement in linear relationship also resulted in a reduction in the root mean squared error (RMSE) (Fig. 7e), indicating less deviation and variability between replicates, which is crucial for downstream analyses in cancer genomics and toxicology studies [13]. Furthermore, SigRescueR improved the cosine similarity to SBS22a from 0.87 to 0.91, validating its ability to refine ecNGS-derived mutational landscapes (Fig. 7f and g).
Collectively, these results demonstrate that SigRescueR not only removes low-confidence artifacts and sequencing noise from raw variant calls but also enhances the detection and interpretability of true mutational signatures. This approach establishes a high-confidence framework for ecNGS-based mutagenesis studies, expanding the analytical capabilities of duplex sequencing technologies.
Discussion
Deconvolution of mutational signatures is fundamental to understanding the complex mutational processes underlying cancer and environmental mutagenesis. However, accurately distinguishing true exposure-related signatures from background noise remains a persistent challenge due to sequencing errors and biological variability [7, 10, 28, 32, 33]. Our study demonstrates that SigRescueR addresses this challenge by implementing a robust baseline correction approach, effectively revealing bona fide mutational signals that are otherwise obscured by background noise.
SigRescueR is designed around three core strategies. First, it applies a strong baseline correction, carefully avoiding negative values, ensuring the inferred exposure mutation signature remains biologically meaningful. Second, it considers the overall shape of the observed by assessing how well the observed data fit the reconstructed profile using a Dirichlet distribution. Third, a COM-Poisson likelihood [41], penalizes large deviations from the observed count, preventing the exposure profile from absorbing leftover counts. The tool is well-suited for single-sample analyses.
To evaluate SigRescueR’s performance, we applied the tool across a broad spectrum of experimental models, incorporating murine and human systems with well-characterized spectra, in addition to patient-derived tumor datasets. We demonstrate that SigRescueR not only enhances the fidelity of reconstruction, but also reliably distinguishes true exposure signatures from non-mutagenic backgrounds, limiting false positive assignments, a critical feature for high-confidence mutational inference. Moreover, SigRescueR’s shows high reproducibility across biological replicates, even under variable exposure intensities, providing a key advantage for mutagenesis studies and mechanistic toxicology [14].
Moreover, we showcased SigRescueR’s ability to deliver consistent results across SBS96 and SBS288 classifications, as well as for indels and DBS. Compared to existing methods [25, 34, 46, 48], SigRescueR consistently matches or surpasses their performance, demonstrating superior accuracy and robustness. We show that simple subtraction [25, 34] and NNLS [46] can generate negative values when the background burden is higher than the exposure. In our study, the negative values, which are biologically implausible, are truncated to zero before comparison against the COSMIC reference. In contrast, SigRescueR inherently constrains all signature contributions to be non-negative. Conversely, SparseSignatures achieves high reconstruction accuracy [48], but this can cause background mutations being absorbed by the exposure signature.
Furthermore, we challenged SigRescueR to refine tumor sequencing data by removing sequencing-related artifacts. This task is particularly difficult due to the absence of a clearly defined true background and the complex mixture of mutational processes accumulated over time in tumor samples [5, 15, 16]. Our results demonstrate that SigRescueR effectively removes sequencing artifacts with high sensitivity and precision. Importantly, SigRescueR can be applied flexibly to clean other signatures beyond sequencing noise, such as SBS17, which is relevant when analyzing mouse samples [29], as well as APOBEC-related signatures (SBS2 and SBS13) [15, 22] that often dominate mutational spectra in patients. This adaptability addresses cases where signature attribution may be ambiguous, with some signatures potentially unassigned and absorbed by others, a challenge SigRescueR can identify and manage [70].
Leveraging duplex sequencing data, SigRescueR’s framework can detect novel low-frequency mutational signatures often obscured by technical artifacts, while also reducing variability between replicates. This capability expands our understanding of mutational heterogeneity, uncovers previously unrecognized mutational mechanisms contributing to carcinogenesis, and supports toxicology and cancer etiology studies.
Despite SigRescueR’s overall strength, we note several limitations. First, SigRescueR relies on Bayesian inference, producing posterior distribution rather than point estimates, that are summarized as posterior mean. We treat the posterior mean as point estimates, despite it being the most probable value rather than the value that minimizes the residual error. Second, SigRescueR is agnostic to the level of background, bounded by the priors, and will remove as much as the posterior determines given the data. If the default settings for summarizing the posterior does not meet the user’s needs, we provide the flexibility to explore the posterior distribution. Third, the computational time is directly proportional to mutation burden as SigRescueR computes the normalizing constant Z for all mutation context required for the COM-Poisson likelihood to ensure the probabilities sum up to one. Despite these limitations, SigRescueR provides a full distribution that can be used to assess uncertainty through credible intervals, rather than committing to a single estimate that likely reflects a minimization.
Taken altogether, our results demonstrate that SigRescueR enhances the accuracy of mutational signature analysis by rescuing biologically meaningful signals obscured by background noise, sequencing artifacts, or modeling bias, enabling new biological discoveries in data that would otherwise remain obscured. It consistently outperforms existing methods in reconstructing mutational signatures with near-perfect fidelity. Its versatility across diverse experimental models, species, sequencing technologies, and mutation classifications highlights its broad applicability. SigRescueR is a first-of-its-kind tool able to reliably identify true mutational signatures from exposure models, even with limited sample numbers. These qualities position it as a powerful tool for mutational signature analysis, capable of advancing studies in cancer genomics and other fields reliant on precise mutation pattern characterization. Future work can further extend its utility to additional data types, experimental conditions, and sequencing platforms by fostering deeper insights into mutagenic processes.
Key Points
- SigRescueR implements statistically rigorous baseline correction to separate true mutational signatures from background noise and technical artifacts, significantly improving the detection of biologically meaningful mutational processes.
- It leverages Bayesian inference to provide credible intervals for reliable exposure estimates, using data-driven priors for broad application and versatility.
- SigRescueR enables comprehensive mutational signature analysis by supporting multiple mutation classes, including single-base substitutions (SBS) with integrated strand bias information (e.g. SBS96 and SBS288 channels), insertions/deletions, and doublet base substitutions.
- The tool is extremely flexible and compatible with various sequencing platforms, such as whole-genome sequencing, whole-exome sequencing, and single-molecule duplex sequencing.
- SigRescueR enhances statistical power and robustness by reducing variability across biological replicates, facilitating more reliable cancer genomics and toxicology studies.
Supplementary Material
bbag099_Supplemental_Files
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Hanahan D, Weinberg RA. The hallmarks of cancer. Cell 2000;100:57–70.10647931 10.1016/s 0092-8674(00)81683-9 · doi ↗ · pubmed ↗
- 2Hanahan D, Weinberg RA. Hallmarks of cancer: the next generation. Cell 2011;144:646–74. 10.1016/j.cell.2011.02.01321376230 · doi ↗ · pubmed ↗
- 3Hanahan D . Hallmarks of cancer: new dimensions. Cancer Discov 2022;12:31–46. 10.1158/2159-8290.CD-21-105935022204 · doi ↗ · pubmed ↗
- 4Vogelstein B, Papadopoulos N, Velculescu VE. et al. Cancer genome landscapes. Science 2013;339:1546–58. 10.1126/science.123512223539594 PMC 3749880 · doi ↗ · pubmed ↗
- 5Alexandrov LB, Nik-Zainal S, Wedge DC. et al. Signatures of mutational processes in human cancer. Nature 2013;500:415–21. 10.1038/nature 1247723945592 PMC 3776390 · doi ↗ · pubmed ↗
- 6Volkova NV, Meier B, González-Huici V. et al. Mutational signatures are jointly shaped by DNA damage and repair. Nat Commun 2020;11:2169.32358516 10.1038/s 41467-020-15912-7PMC 7195458 · doi ↗ · pubmed ↗
- 7Zou X, Koh GCC, Nanda AS. et al. A systematic CRISPR screen defines mutational mechanisms underpinning signatures caused by replication errors and endogenous DNA damage. Nat Can 2021;2:643–57. 10.1038/s 43018-021-00200-0PMC 761104534164627 · doi ↗ · pubmed ↗
- 8Otlu B, Alexandrov LB. Evaluating topography of mutational signatures with Sig Profiler Topography. Genome Biol 2025;26:134.40394581 10.1186/s 13059-025-03612-8PMC 12093824 · doi ↗ · pubmed ↗
