MUSICiAn: genome-wide identification of genes involved in DNA repair via control-free mutational spectra analysis
Colm Seale, Marco Barazas, Robin van Schendel, Marcel Tijsterman, Joana P Gonçalves

TL;DR
This paper introduces MUSICiAn, a new method to identify genes involved in DNA repair by analyzing mutation patterns without needing control groups.
Contribution
MUSICiAn is a novel control-free method for identifying DNA repair genes using mutational spectra deviations.
Findings
MUSICiAn successfully estimates pseudo-controls for Repair-seq screens and identifies 476 genes.
The method recovers known DNA repair genes and highlights the spliceosome as a potential player.
MUSICiAn reveals new candidate genes for further investigation in DNA repair.
Abstract
Understanding DNA double-strand break (DSB) repair is crucial for the development of targeted anticancer therapies, yet the roles of many genes remain unclear. Recent studies show that disruption of known DSB repair genes can alter the sequence-specific distribution of mutations arising after DSB repair, suggesting that genome-wide perturbation screens could be leveraged to identify new DSB genes leading to distinct deviations from the expected wild-type distribution. Given the challenges of designing controls for a genome-wide screen, we explore the high gene throughput to forgo the use of traditional controls by reframing the analysis as an outlier detection problem, assuming that most genes have minimal influence on DSB repair outcomes. We propose MUSICiAn (Mutational Signature Catalogue Analysis), a compositional data analysis method that ranks gene perturbation impact on mutational…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8- —Holland Proton Therapy Center
- —National Institutes of Health10.13039/100000002
- —Netherlands Convergence for Health & Technology
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetic factors in colorectal cancer · Cancer Genomics and Diagnostics · DNA Repair Mechanisms
Introduction
Double-strand breaks (DSBs) in DNA are critical cellular events that occur spontaneously due to endogenous processes like replication or external agents like ionizing radiation. Left unaddressed, DSBs can lead to genomic instability and eventually cell death or cancer [1]. As a result, cells have evolved a suite of mechanisms to repair DSBs, including the non-homologous end joining (NHEJ), microhomology-mediated end joining (MMEJ, also called alt-NHEJ), and homology-directed repair (HDR) pathways [1]. Understanding the roles that genes play in DSB repair can importantly contribute to the development of targeted therapies for diseases such as cancer [2]. For example, PARP inhibitor drugs are indicated to treat cancers with impaired HDR or BRCA gene function, whose synthetic lethality with PARP is leveraged to block DSB repair and cause a fatal accumulation of DNA damage in HDR- or BRCA-deficient cancer cells [3]. The ability to discover further opportunities for targeted therapy requires deeper insight into gene function, yet for many genes the link with DSB repair remains unclear.
To search for these links, research has turned to techniques that can simultaneously probe many genes at once. Initial screens measured the effects of gene silencing or knockout on readouts such as cell growth and proliferation, using these as proxies to identify repair factors [4–6]. However, inhibition of cellular growth is only indirectly related to DSB repair and could lead to results confounded by other mechanisms of cellular activity. Later studies have leveraged machine learning to predict chromatin regulators of repair from high-content imaging-based repair kinetics data [7], or to identify regulators of HDR from multi-omics data [8]. More recently, advances in CRISPR technology [2] have combined the induction of DSBs at precise loci with sequencing of the break site post-repair to analyze how gene disruptions alter mutational spectra, or the frequency distributions of mutations arising at DSB sites following repair [9]. Multiple studies have demonstrated that knockouts of certain genes yield distinct, sequence-specific mutational spectra [9], but focused on the screening of known DSB repair genes. Notably, the first genome-wide study characterizing the effect of gene perturbation on mutational spectra has been released. We obtained early access to the data from this study, termed Mutational Signature Catalogue (MUSIC, [10]).
Using CRISPR targeting with mutational spectra as readout, the primary approach to link genes to DSB repair is to quantify how much the mutational spectrum deviates from the expected wild-type distribution following the knockout of each individual gene. The larger the deviation, the higher the confidence that the perturbed gene has an effect on the outcomes and could thus be involved in DSB repair. Recent work by Hussmann et al. [11] defined this deviation as the “overall outcome redistribution activity,” quantified by a chi-squared-like statistic relying on nontargeting controls to determine the expected wild-type spectrum.
For genome-wide screens, a limited set of nontargeting controls might not be suitable. While the majority of genes is not expected to produce an effect on the mutational spectra, it is unclear whether targeting such genes could indirectly or mildly influence the outcomes, an effect that would not be appropriately captured by nontargeting controls. At the same time, it would be challenging to design realistic controls for all levels of variation at play in a genome-wide screen, while trying to maximize the coverage per mutational spectra and mitigate batch effects. We explore an alternative approach leveraging the assumption that most genes in a genome-wide perturbation screen have minimal impact on the mutational spectra to frame the data analysis task of identifying DSB repair genes as an outlier detection problem.
When analyzing mutational spectra, it is also important to consider their compositional nature. In other words, each mutational spectrum is a distribution of relative frequencies over a collection of mutation categories whose overall sum is one. This composition property introduces a negative correlation bias caused by dependencies between the different frequencies, where an increase for one mutation type necessarily causes a reduction in others. Ignoring the dependencies in compositional data using standard data analysis techniques can lead to misleading results and interpretation [12]. Additionally, the covariance structure of mutational spectra is likely to be skewed by the outlier gene knockouts that significantly affect DSB repair, emphasizing the need for methods tailored for compositional data analysis (CoDA).
We introduce MUSICiAn (Mutational Signature Catalogue Analysis), a computational approach to score gene associations with DSB repair via genome-wide mutational spectra analysis. MUSICiAn operates without nontargeting controls, framing the computational task as an outlier detection problem under the assumption that most genes have little influence on the mutational spectra. MUSICiAn uses the CoDA framework to address dependencies and outliers in genome-wide mutational spectra data, for an improved estimation of pseudo-controls. By ranking gene knockouts based on their robust deviation from the overall mutational spectra distribution, MUSICiAn provides a control-free approach for genome-wide discovery of DSB repair-related genes.
We evaluate the MUSICiAn estimation of pseudo-controls on the Repair-seq dataset, screening 476 DSB genes and 60 nontargeting controls [11]. We further apply MUSICiAn to the genome-wide MUSIC mutational spectra dataset, covering 18 406 genes, to investigate the ability of this control-free method to recover established repair genes and suggest new candidates for experimental validation.
Materials and methods
We introduce the MUSICiAn method using outlier detection to identify DSB repair genes from genome-wide CRISPR mutational spectra without traditional controls. The aim is to quantify the effect that each gene knockout produces on the mutational spectra relative to the expected wild-type or control spectra. In the absence of controls, MUSICiAn leverages the assumption that most genes have minimal impact on the mutational spectra to estimate the center of the mutational spectra distribution as a representative point, close to which the spectra will be most alike the expected wild type. To quantify the deviation, MUSICiAn calculates a distance between each spectra and the estimated center also taking the covariance of the spectra distribution into account. This is done using a combination of data transformation and robust covariance estimation designed to address dependencies and outliers in the mutational spectra data. Finally, MUSICiAn creates a unified gene outlier score based on the distances obtained across target sites.
Data and preprocessing
Mutational outcome data
We analyzed data from two gene perturbation screens with CRISPR-induced mutational outcome readout, Repair-seq, and MUSIC (Fig. 1 for an illustration of the experimental setup). The Repair-seq screen used CRISPR interference with each of 1573 single-guide RNAs (sgRNAs) to individually silence each of 476 DSB repair genes and 60 nontargeting control sgRNAs [11]. To generate mutational outcomes, Repair-seq used CRISPR–Cas9 to create DSBs for a single target site across the population of cells with and without silenced genes, in two biological replicates [10]. The genome-wide MUSIC screen was similarly set up, but used CRISPR knockouts rather than interference, with 89 571 sgRNAs spanning 18 406 genes, and generated outcomes for three target sites in two biological replicates. We downloaded the raw Repair-seq sequence data from the NCBI Sequence Read Archive (SRA), BioProject PRJNA746980, runs SRR15164738 and SRR15164739. We obtained early access to the MUSIC sequence data, available under NCBI SRA BioProject PRJNA1248447, runs SRR33046049–SRR33046054.
Illustration of CRISPR gene perturbation screens with mutational spectra readout. First, sequences are integrated into the genomes of cells via lentiviral transduction. Each sequence contains two elements: (i) a sgRNA-encoding region to knockout (MUSIC) or silence (Repair-seq) a single gene, and (ii) a region common to all integrated sequences to be targeted with CRISPR to produce the mutational spectra. After genomic integration, several days of cell culture are allowed for genes to be knocked out. Following this, MUSIC again uses lentiviral transduction to introduce sgRNAs targeting the common region to the Cas9-expressing cells. Repair-seq uses electroporation to introduce Cas9 RNP complexes to the cells to induce DSBs at the target site. After allowing time for cell culture for DNA cleavage and repair, DNA sequencing was performed to capture the final CRISPR repair products. Created in BioRender: Seale, C. (2025) https://BioRender.com/ufk5u50.
We called mutations from the sequence data using the Sequence Interrogation and Quantification (SIQ) tool [13] v4.3 with parameters “-m 2 -c -e 0.05,” specifying a minimum number of two reads for counting an event, the collapsing of identical events to a single record with the sum of counts, and a maximum permitted base error rate of 0.05. The SIQ tool mapped the reads to the sgRNAs used for gene perturbation and identified mutations observed at the CRISPR–Cas9-induced DSB sites.
Mutation aggregation and categorization
To reduce sparsity and improve statistical power, the fine-grained mutational outcomes output by the SIQ tool were aggregated into eight higher-level categories: wild-type, denoting a sequence without mutations; deletion with 1\2\3+bp\no microhomology, for a deletion overlapping the cut site with a microhomology (MH) of length 1 bp to 3+ bp or no MH at all, where an MH is a short homologous sequence on both sides of the DSB and used for repair by MMEJ [1]; insertion, for a new sequence added at the cut site; deletion with insertion, for a combination of deletion and insertion; and homology-directed repair, for any insertion matching the donor template DNA (Supplementary Table S3 for SIQ versus MUSICiAn categories). Any other rare mutation types, such as single-base substitutions, were excluded, since they are not typical outcomes of DSB repair. Wild-type reads were excluded to avoid confounding, as changes in their abundance could arise from factors unrelated to DSB repair, such as gene essentiality or reduced Cas9 cleavage efficiency. We thus considered a final set of seven mutation categories.
Quality analysis and filtering of perturbation sgRNAs
For each replicate, we filtered out perturbation sgRNAs yielding a total read count below 700 across the seven mutation categories, and excluded genes with less than three associated perturbation sgRNAs. Additionally, we controlled for inconsistencies in the effect of the different sgRNAs used for perturbation of the same gene, which could be indicative of sgRNA off-target effects, less effective gene perturbation, or any other undesirable effect. We excluded sgRNAs whose count profiles over the seven mutation categories showed a median pairwise Pearson’s correlation below 0.6 with the profiles for other sgRNAs perturbing the same gene within the same replicate (and target site), or a median pairwise Pearson’s correlation below 0.6 with replicate profiles for the same sgRNA and target site (Supplementary Tables S1 and S2). To avoid numerical issues with the data transformation applied by MUSICiAn later on, in the rare cases where some mutation categories had zero counts, we imputed real values drawn independently from a uniform distribution between the detection limit \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mathrm{ DL}\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} 0.1 \times \mathrm{ DL}\end{document} , where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mathrm{ DL} = 1\end{document} . [14].
Generating mutational spectra per gene
We first computed mutational spectra by dividing the count of each of the seven mutation categories by the total per sgRNA and replicate. Then, we aggregated across sgRNAs by calculating the geometric mean of the sgRNAs-associated spectra per gene and replicate, producing replicate spectra per gene (two for each target site). Finally, we computed the geometric mean between replicate spectra per target site, resulting in one mutational spectra per gene and target site. After every aggregation step, the frequencies in each mutational spectra were divided by the sum to make sure they summed to one.
MUSICiAn scoring of gene effect on mutational spectra
The MUSICiAn method scores genes for DSB repair association by computing the distance between the mutational spectrum of each perturbed gene and the estimated center of the distribution of all spectra obtained for a given target site (Fig. 2). For experiments with multiple targets, target-specific scores can be normalized and averaged to produce one single gene score. Genes with larger scores have a more prominent effect on the mutational spectra, thus also a higher likelihood of being involved in DSB repair. Genes with the lowest scores are assumed to approximate the wild-type or control distribution. We are interested in both the most outlying and the most central spectra for downstream analysis.
Overview of MUSICiAn scoring of gene effect on mutational spectra. The method quantifies the effect of gene perturbation using the Mahalanobis distance of the gene mutational spectrum to the estimated center of the spectra distribution of all genes, under the assumption that most genes have a negligible effect. Estimation is improved using isometric log-ratio (ILR)-transformed spectra and robust covariance [(minimum covariance determinant (MCD)] to mitigate biases from data closure and outliers. Distances are normalized and averaged across target sites to produce a unified gene effect score. Created in BioRender: Seale, C. (2025) https://BioRender.com/q18ev42.
Gene scoring
To calculate gene effect scores, MUSICiAn computes the Mahalanobis distance [15] per gene spectrum relative to the center of the overall spectra distribution per target site, assuming that most genes do not have a measurable impact on the mutational spectra. Genes that alter the mutational spectra are likely to be directly involved in DSB repair (Fig. 2). We note that, depending on the experimental design, not all DSB repair genes may have a measurable effect on the mutational spectra. We chose the Mahalanobis distance as it takes the distribution and covariation of the data into account, unlike the Euclidean distance. Informative Mahalanobis distances require reliable covariance estimation, which is affected by: data closure, where dependencies between spectra categories summing to one introduces a negative correlation bias [12]; and outlier genes with a significant impact on mutational spectra and therefore also on the distribution. To mitigate data closure, MUSICiAn applies an ILR transformation [16] to the mutational spectra using the defaults for scikit-bio 0.5.4. The ILR transformation maps the data from a constrained simplex space to an unconstrained Euclidean space, allowing for independent statistical analysis of components. To mitigate outlier effects, MUSICiAn uses the MCD as a robust covariance matrix estimator [17], using scikit-learn 1.2.1 defaults. The MUSICiAn method calculates the robust Mahalanobis distances for each ILR-transformed spectra and unifies the individual distances into gene scores across target sites by: (i) selecting the common genes with mutational spectra in all target sites to act as a reference, (ii) calculating the mean and standard deviation of the distances of the reference genes per target site, (iii) normalizing all gene distances per site by subtracting the means and dividing standard deviations to place them on a common scale, and (iv) averaging the normalized gene distances across sites, ignoring missing values, to produce a final unified gene score.
Pseudo-control selection
The target-specific distances and unified gene scores enable the selection of “pseudo-controls” as the lowest-scoring genes per target or common across target sites. These pseudo-controls enable comparative analyses by estimating the central tendency of traditional controls, but may not recapitulate their natural variation.
Evaluation
Before applying MUSICiAn to the genome-wide MUSIC dataset, we assessed the outlier detection and pseudo-control selection on the Repair-seq dataset, the only other dataset available of CRISPR mutational spectra for multiple individual gene knockouts. While not a genome-wide dataset, Repair-seq included nontargeting controls, allowing us to assess whether and how well MUSICiAn could estimate the wild-type distribution center. Furthermore, Repair-seq data focused on genes involved in DNA repair, so we also checked whether MUSICiAn could recover similar mutational patterns for the genes screened in both studies. We preprocessed the Repair-seq data as described and held out the nontargeting controls from the scoring for later validation.
Estimation of pseudo-controls
We used principal component analysis (PCA) to visualize the effect of ILR transformation and robust MCD covariance, proposed to mitigate compositional data closure and outlier spectra, on the estimation of the mutational spectra distribution center and selection of pseudo-controls for the Repair-seq dataset. We applied PCA in four scenarios: Classical Covariance, using the original mutational spectra with the classical covariance estimation; MCD Covariance, using the original spectra with the outlier-robust MCD covariance estimation; ILR and Classical Covariance, using ILR-transformed spectra with classical covariance estimation; and ILR and MCD Covariance, using ILR-transformed spectra with MCD covariance estimation. After ILR transformation, location, and covariance estimation, we back-transformed the data to centered log-ratio space to analyze the relation between PCA components and mutation categories [17].
To evaluate pseudo-control selection, we identified 60 pseudo-controls for each scenario and calculated the Jensen–Shannon distance (JSD) between the geometric means of the nontargeting control and the pseudo-control spectra. As a baseline, we also calculated the distance from the nontargeting control spectra to the geometric mean across all gene-targeting sgRNAs, without pseudo-control selection. The JSD quantifies the distance between distributions, where a lower distance indicates greater similarity between distributions.
Cross-dataset estimation of pseudo-controls
To further assess the selection of pseudo-controls, we analyzed the consistency in mutational patterns retrieved for the MUSIC and Repair-seq datasets, using pseudo-controls estimated by MUSICiAn jointly from the two datasets. Specifically, we applied MUSICiAn to select 60 pseudo-controls for the set of all mutational spectra associated with the 434 genes shared across both datasets, with four target sites in total (three for MUSIC, one for Repair-seq). We then calculated the difference in mutation frequency per category between each gene-related mutational spectra and the geometric mean of the pseudo-controls. Finally, we performed hierarchical clustering [18] on the resulting difference matrix, using Ward cluster linkage and distance between samples based on Pearson’s correlation.
Gene scoring and ranking performance
To evaluate the quality of the MUSICiAn-derived gene effect scores for the genome-wide MUSIC dataset, we examined whether MUSICiAn could effectively recover genes with known links to DSB repair by scoring or ranking them higher than other genes based on their effect on the mutational spectra. We assessed performance separately using precision-recall (PR) curves against known DSB repair genes from two sources: 476 experimentally validated genes curated by Repair-seq for their AX227 CRISPRi library [11], and 295 genes whose annotations matched the regex “double-strand break repair \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \vert\end{document} interstrand cross-link repair” (interstrand cross-link repair genes often crosstalk with DSB repair pathways such as HR [19]) in any field in the Gene Ontology (GO) [20, 21]. For baseline comparison, we calculated PR curves after randomly ranking all genes in the MUSIC dataset. We preferred PR rather than receiver-operating characteristic (ROC) curves, given that the dataset is highly imbalanced, where most genes have no known association with DSB repair and are therefore considered negative for the purpose of the evaluation.
Functional enrichment for top 500 ranked genes
We performed enrichment analysis for the top 500 genes ranked by MUSICiAn against the background of all genes in the MUSIC dataset, using the “gseapy” Python package 1.0.4. We employed four sets of annotations, including KEGG pathways “KEGG_2019_Mouse” [22], and GO terms across the three ontologies “GO_Biological_Process_2023,” “GO_Molecular_Function_2023,” “GO_Cellular_Component _2023.” We performed a hypergeometric test per term within each gene set, and the resulting P-values were FDR-corrected using the Benjamini–Hochberg method. [23]. We further estimated the effect of the genes annotated with each of the top 10 enriched terms or pathways on the mutation frequencies separately for the four annotation sets. To do this, we fitted an ordinary least squares regression model per term \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} t\end{document} and mutation outcome category \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} o\end{document} to explain the variation in mutation frequency ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mathrm{ Frequency}\end{document} ) based on term or pathway membership ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mathrm{ Group}\end{document} ), according to the following R-style formula
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{eqnarray*} {{\mathrm{ Frequency}_{g,o}}} \sim {{\mathrm{ Group}_{g,t}}}, \end{eqnarray*}\end{document}where each sample is a mutational spectra for a given gene knockout \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} g\end{document} . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mathrm{ Frequency}{g,o}\end{document} variable denotes the frequency of the given mutation outcome \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} o\end{document} for gene \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} g\end{document} , and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mathrm{ Group}{g,t}\end{document} is a binary variable indicating whether gene \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} g\end{document} is a member of term or pathway \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} t\end{document} . As case samples, we took the mutational spectra of all genes annotated with the enriched term in question. As control samples, we used the set of 100 pseudo-controls or lowest scoring genes, with valid mutational spectra across all target sites, and that were not members of any of the enriched pathways. We used the same control samples for the regression analysis of every annotation set, and report the regression coefficients and Benjamini–Hochberg-corrected P-values for the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mathrm{ Group}\end{document} variable.
Results
MUSICiAn can estimate absent control mutational spectra
We first assessed the ability of MUSICiAn to estimate pseudo-control mutational spectra in the absence of actual controls. To do this, we applied MUSICiAn to the Repair-seq dataset, containing mutational spectra for one target site across knockouts of 476 different genes and 60 actual nontargeting controls. The actual controls were left out to be able to quantify how well they could be recovered by MUSICiAn.
We also isolated the contributions of the ILR transformation and robust covariance (MCD) used by MUSICiAn to investigate whether they improved the estimation of the distribution center location and covariance, and ultimately the selection of pseudo-controls, in the presence of outlier spectra and negative correlation bias.
To visually examine the effect of ILR transformation and MCD on the distribution, we applied PCA to the original and ILR-transformed mutational spectra separately using classical PCA and a robust variant of PCA based on the MCD.
The estimated center of the distribution appeared to align the best with the center determined based on the actual controls (geometric mean of the 60 nontargeting controls) when both ILR and MCD were used to respectively address data closure and outliers in the mutational spectra data (Fig. 3A, “ILR and MCD Covariance” versus others). We further observed that the pseudo-controls selected as the 60 mutational spectra closest to the center of the distribution estimated by MUSICiAn, using any of the four combinations of spectra and covariance types, were far more similar to the actual nontargeting controls than the average across all spectra. Specifically, the JSD between the geometric means of pseudo-controls and nontargeting controls were one order of magnitude smaller than those between the geometric means of all spectra and nontargeting controls (respectively, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mathrm{ <} 0.005\end{document} and 0.012, Fig. 3B). Moreover, the selection of pseudo-controls using the preferred combination of techniques in MUSICiAn, ILR transformation and robust MCD covariance produced the closest match with the actual nontargeting controls than the other three (JS distances \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} 3.73 \times 10^{-3}\end{document} against \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} 3.77 \times 10^{-3}\end{document} , \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} 4.42 \times 10^{-3}\end{document} , and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} 4.69 \times 10^{-3}\end{document} ; Fig. 3B). This result supported our choice to place ILR transformation and MCD at the core of the MUSICiAn outlier detection algorithm.
Evaluation of MUSICiAn-selected pseudo-controls based on the estimated mutational spectra distribution center for Repair-seq. Effect of ILR transformation and MCD covariance on (A) the estimated center of the mutational spectra distribution and (B) the selected pseudo-controls, using the original or ILR transformed spectra with classical or MCD covariance. For panel (A), actual center (black cross) of the mutational spectra distribution as the geometric mean of the 60 actual controls (yellow points), and center estimated by MUSICiAn (red cross) based on the mutational spectra under gene-silencing (blue points), projected onto the two axes of largest variation in the data (first two principal components). For panel (B), JSD between the geometric means of the 60 actual nontargeting controls and either all mutational spectra or the 60 pseudo-controls closest to the center estimated using each of the four combinations of spectra and covariance types.
We note that the majority of the 476 genes characterized in the Repair-seq screen are known to be involved in DNA repair, and therefore the assumption that most genes should not have an impact on the mutational spectra was in theory not necessarily met for this dataset. However, in practice, a large proportion of DNA repair genes still showed little effect on mutation frequencies (Fig. 3). The fact that MUSICiAn was able to recover controls in this scenario highlights that it could be applicable to more focused studies beyond genome-wide screens whenever a similar reasonable assumption can be made, for instance based on prior knowledge or the actual distribution of the data.
MUSICiAn controls reveal known repair patterns across studies
We further questioned whether MUSICiAn could estimate pseudo-controls for mutational spectra aggregated from different studies, such that consistent mutational repair patterns would be revealed when applying the same controls as a baseline across the studies. To address this, we jointly analyzed the mutational spectra for knockouts of the 434 genes screened in both the genome-wide MUSIC and the focused Repair-seq studies. After selecting pseudo-controls, we calculated the differences between the frequencies in each mutational spectrum, obtained under silencing or knockout of a specific gene, and the geometric mean of the pseudo-controls (Fig. 4). We also performed hierarchical clustering of genes and mutation categories based on those differences (Fig. 4). The results revealed consistency in how HDR and insertion events were influenced by silencing of specific genes across targets and studies, as well as broadly consistent patterns for other mutation types with larger variations that could be attributed to differing target-site-specific characteristics within and between studies.
Heatmap of the difference in mutation frequencies between each spectrum obtained for the knockout of a specific gene and the geometric mean of the pseudo-control spectra selected by MUSICiAn, per mutation category, and target site. Shown are the top 100 genes with the highest MUSICiAn outlier score across target sites (3 for MUSIC, denoted T1–T3, and 1 for Repair-seq). The horizontal axis represents genes. The vertical axis represents mutational outcomes, colored by target site. Data were clustered on both dimensions, genes and mutation categories, using hierarchical clustering with Ward cluster linkage and distance between spectra based on Pearson’s correlation coefficient.
Gene clustering also identified meaningful groups, including the Fanconi anemia core complex and related genes, whose silencing suppressed HDR events (Fig. 4). Interestingly, Helq displayed a mutational pattern similar to these genes, suggesting a potential association with FA and HDR, a topic of ongoing debate [24, 25]. Other notable clusters included: mismatch repair MutS homolog genes (Msh2, Msh6); ring finger protein genes with roles in DNA damage sensing and repair (Rnf8, Rnf168); NHEJ genes involved in early recognition of DNA damage and recruitment of additional repair factors (Xrcc5, Xrcc6), and in the processing of DNA ends (Artemis complex Prkdc and Dclre1c); and the MRN complex with roles in ATM checkpoint activation in response to DNA damage and also the tethering of broken DNA ends for further processing by NHEJ and HDR (Mre11a, Rad50, Nbn). The consistency in gene silencing effects on mutational spectra across the MUSIC and Repair-seq datasets, along with the identification of groups of genes with related function in DNA damage response, provided support for the effectiveness of the MUSICiAn control-free analysis in estimating pseudo-controls, quantifying effects, and ultimately generating meaningful insights from CRISPR targeting under gene silencing screens with mutational spectra readout.
MUSICiAn recovers known gene–DSB repair associations
In addition to estimating pseudo-controls, MUSICiAn attributes an outlier score to each gene, which determines the multivariate effect of gene silencing on mutational spectra to suggest (novel) associations between the gene and DNA damage response. In this context, we first applied MUSICiAn to the genome-wide MUSIC dataset to assess whether it could recover known repair genes. Genes were ranked by their MUSICiAn outlier score, and the ranking was evaluated against the set of 476 genes curated by Repair-seq and an alternative set of 295 genes retrieved from the GO. The closer to the top of the ranking these genes appeared, the better the results. We also performed the same evaluation on a randomly shuffled ranking as a baseline for comparison. The MUSICiAn method showed superior rankings for known associations with area under the PR curve (AP) of 0.07 and 0.08 for the Repair-seq and GO gene sets, respectively, compared to an AP of 0.02 for the random baseline (Fig. 5).
Performance of MUSICiAn, recovery of known DNA repair genes. PR curves using MUSICiAn ranking (orange) or random ranking (blue) against the following gold standards for evaluation: (left) curated genes in Repair-seq and (right) genes annotated with DSB repair-related in GO.
Pathway enrichment analysis of the top 500 genes using KEGG annotations revealed significant associations with the “Fanconi anaemia” and “homologous recombination” pathways (Fig. 6). A link with “nucleotide excision repair” was also identified, supporting the idea that single- and double-strand repair mechanisms are functionally intertwined [26]. Another enriched pathway, “cell cycle,” is known to influence DNA repair pathway choice [26]. Many DSB repair genes were also implicated in the “DNA replication” pathway [27].
Top 10 enriched KEGG pathways and GO terms among the top 500 genes ranked by MUSICiAn across targets for the genome-wide MUSIC dataset. Top left to bottom right—KEGG pathways, GO biological processes, GO molecular functions, and GO cellular components. The horizontal axis shows the ratio between the numbers of genes annotated with the pathway or GO term among the top 500 ranked genes versus all genes. Circle color denotes the negative log10 of the FDR-corrected \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \end{document}-value, and circle size indicates the number of genes annotated with a pathway or GO term among the top 500 ranked genes.
Functional enrichment analysis of the top 500 genes using GO annotations revealed links with repair-related biological processes (Fig. 6), including “DNA repair,” “double-strand break repair,” “double-strand break repair via homologous recombination,” and “interstrand cross-link repair,” further reinforcing the ability of MUSICiAn scores to capture and prioritize effects of genes on mutational spectra following the repair of CRISPR-induced DSB sites. Regarding molecular function, various binding activities, including DNA, damaged DNA, and ubiquitin-like protein ligase binding, as well as single-strand DNA helicase activity were identified, all functions required for DNA damage signaling and repair [28–30] (Fig. 6).
Overall, MUSICiAn recovered known patterns and associations relevant to the repair of double-strand DNA breaks. While the AP performance may appear modest, it is significantly better than random. Nevertheless, mutational spectra exhibited relatively low coverage per sgRNA (median: MUSIC 2361.08 versus Repair-seq 565201.97), leading to noisier mutational spectra that posed additional challenges in differentiating between true repair factors and noisy samples. Moreover, the assumption that mutational spectra deviating from the expected wild type arise upon silencing of genes associated with DNA repair does not preclude the existence of other genes involved in DNA repair that do not affect mutational spectra. Such genes may not play a central role in the pathway, or their loss of function may be compensated by other genes, resulting in smaller effects and appropriately lower MUSICiAn rankings, while negatively biasing the AP.
MUSICiAn identifies lesser-appreciated players in DSB repair
After analyzing established genes and pathways, we also examined several lesser-recognized pathways and processes emerging from the MUSICiAn analysis of the MUSIC dataset. Intriguingly, “Ribosome biogenesis in eukaryotes” was the top enriched KEGG pathway (Fig. 6), aligning with emerging literature from the last decade suggesting a potential crosstalk between ribosome biogenesis and DNA repair pathways [31]. Recent studies have also implicated the nucleolus, a major site of ribosome synthesis and the top enriched cellular component, in the regulation of cellular processes, including DNA repair [32–34].
The proteasome and spliceosome were additionally identified as enriched pathways. The proteasome plays a role in the regulation of the Rnf8-Rnf168 pathway, which itself works to recruit repair factors to DSB sites [29, 35], and the inhibition of which has been previously shown to reduce HDR events [36]. As for the spliceosome, there is growing evidence of a role in DNA repair, with studies suggesting that splicing regulates the expression of Rnf8, further controlling ubiquitin-signaling at DSBs [30].
Enriched pathways promote homology-directed repair
We analyzed how the genes in the identified pathways influenced the frequencies of different mutation types by fitting a linear regression model per pathway, mutation type, and target site, and using the mutation type frequency per gene knockout and target site as response variable. Some pathways lacked sufficient gene representation to fit a reliable regression model ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mathrm{ < }3\end{document} samples) and were excluded on a per-analysis basis (Fig. 7).
Effect of genes annotated with the top 10 enriched pathways on mutation frequency. We considered all genes in the top 10 pathways enriched among the top 500 genes ranked by MUSICiAn based on the genome-wide MUSIC spectra, with the final number of genes available per target dependent on the quality of the obtained mutational spectra. Each dot denotes a linear regression analysis of gene effect on mutation frequency per term or pathway (vertical axis, with gene count), mutation category (horizontal axis), and target site (panels for targets T1–T3 in MUSIC). Dot color denotes the regression coefficient, and dot size indicates the negative log10 of the FDR-corrected P-value. Points with nonsignificant corrected P-values (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \end{document}) were excluded.
Based on the fitted models, we observed that the genes in each of the enriched pathways promoted HDR events and repressed insertion events across the target sites in the MUSIC genome-wide screen (Fig. 7, T1, T2, T3). Since NHEJ has been associated with introducing insertions at CRISPR-induced DSB sites [37, 38], we suggest that the rise and fall in the frequency of insertion and HDR events could reflect a change in the fraction of DSBs repaired via the NHEJ and HDR pathways. On the other hand, patterns pertaining to the promotion or inhibition of deletion events with or without MH were more sequence-context dependent, making it difficult to associate an inhibited pathway with how it might influence NHEJ and MMEJ. We note that the additional variation exhibited by MUSIC target site T1 could be an artifact of the noisier mutational profiles obtained for that target.
MUSICiAn identifies novel gene–DSB repair associations
Analysis of the top 5 genes ranked by MUSICiAn for the genome-wide MUSIC dataset (Fig. 8) revealed two well-known DSB repair genes, H2ax [39] and Xrcc5 [40]. The others three genes, Atp6v1g1, Metap2, and H2ac18, were not annotated with the “double-strand break repair” GO term. The top-ranked gene was Atp6v1g1, for which one other study has reported an effect on HDR repair frequency after knockdown of Atp6v1g1 via RNA interference [41]. The MUSIC spectra for target sites T2 and T3 showed a relative decrease in the frequency of HDR events after CRISPR knockout of Atp6v1g1 compared to the geometric mean of the pseudo-controls. A similar tendency was observed for Metap2, a gene associated with ribosomal activity, and for H2ac18, a histone gene. Identifying histones is not surprising, as the chromatin state regulates DNA damage response by modulating accessibility to DNA damage sites by repair factors [42]. However, to our knowledge, no previous studies have identified an influence of Metap2 or H2ac18 on DNA repair pathways or HDR in particular. Further experimentation will be required to validate the impact of these top-ranking genes on mutational spectra and to investigate their role within the DSB repair process.
Mutational spectra of the top 5 genes ranked by MUSICiAn based on the genome-wide MUSIC screen. Colored dots denote the frequency obtained under knockout of the indicated gene. The vertical axis shows mutation types. The horizontal axis shows frequency. The grey lines represent 300 randomly sampled genes. The black lines show the geometric mean of the pseudo-controls. The colored dots show the top genes. Some dots are not shown for T1, as the sgRNAs were filtered out during quality analysis.
Conclusion
In this work, we introduced MUSICiAn, a control-free method to identify genes involved in DSB repair from gene perturbation screens with mutational spectra readout. MUSICiAn is developed for genome-wide perturbation screens and leverages the fact that most genes have negligible influence on DSB repair and mutational spectra to frame the discovery as an outlier detection task. The goal of MUSICiAn is to both estimate the central tendency and identify genes with outlying spectra by analyzing the distribution of all mutational spectra.
Pseudo-controls estimated by MUSICiAn provided a good approximation of the actual nontargeting controls available for the Repair-seq dataset, showing that MUSICiAn could also be effective at subgenome scale, provided the assumption that most genes have minimal effect on the spectra can reasonably be made. Notably, the combination of ILR transformation and robust covariance used by MUSICiAn contributed to an improved estimation of the central tendency and pseudo-controls.
Further MUSICiAn analysis of the genome-wide MUSIC data demonstrated an ability to recover known DSB repair genes and suggest candidates for further investigation, including Atp6v1g1, Metap2, and H2ac18. Our findings indicated that genes involved in ribosome biogenesis, the proteasome, and the spliceosome could play a significant role in modulating the frequency of HDR events, suggesting their involvement in DSB repair.
Obtaining sufficient coverage in genome-wide perturbation studies with sequence-based output remains a challenge that has also been noted in prior studies [11]. Low coverage could limit the ability to detect subtle changes in mutagenic activity for rarer outcomes as the data become too sparse. To address this, we chose to aggregate mutational outcomes into broader categories. However, MUSICiAn could be applied with any collection of outcomes, as fine-grained as desired, and as the resolution across the different outcomes allows.
Overall, the results of MUSICiAn on the Repair-seq and the genome-wide MUSIC datasets highlighted that the method can effectively estimate pseudo-controls and identify genes with an impact on mutational spectra, enabling analyses of large-scale screens where designing realistic controls may be challenging.
Supplementary Material
lqaf202_Supplemental_File
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ceccaldi R, Rondinelli B, D’Andrea AD. Repair pathway choices and consequences at the double-strand break. Trends Cell Biol. 2016;26:52–64. 10.1016/j.tcb.2015.07.009.26437586 PMC 4862604 · doi ↗ · pubmed ↗
- 2Awwad SW, Serrano-Benitez A, Thomas JC et al. Revolutionizing DNA repair research and cancer therapy with CRISPR–Cas screens. Nat Rev Mol Cell Biol. 2023;24:477–94.36781955 10.1038/s 41580-022-00571-x · doi ↗ · pubmed ↗
- 3Chen A . PARP inhibitors: its role in treatment of cancer. Chin J Cancer. 2011;30:463. 10.5732/cjc.011.10111.21718592 PMC 4013421 · doi ↗ · pubmed ↗
- 4Hurov KE, Cotta-Ramusino C, Elledge SJ. A genetic screen identifies the Triple T complex required for DNA damage signaling and ATM and ATR stability. Genes Dev. 2010;24:1939–50. 10.1101/gad.1934210.20810650 PMC 2932975 · doi ↗ · pubmed ↗
- 5O’Connell BC, Adamson B, Lydeard JR et al. A genome-wide camptothecin sensitivity screen identifies a mammalian MMS 22L–NFKBIL 2 complex required for genomic stability. Mol Cell. 2010;40:645–57. 10.1016/j.molcel.2010.10.022.21055985 PMC 3006237 · doi ↗ · pubmed ↗
- 6Olivieri M, Cho T, Álvarez-Quilón A et al. A genetic map of the response to DNA damage in human cells. Cell. 2020;182:481–96. 10.1016/j.cell.2020.05.040.32649862 PMC 7384976 · doi ↗ · pubmed ↗
- 7Martinez-Pastor B, Silveira GG, Clarke TL et al. Assessing kinetics and recruitment of DNA repair factors using high content screens. Cell Rep. 2021;37:110176. 10.1016/j.celrep.2021.110176.34965416 PMC 8763642 · doi ↗ · pubmed ↗
- 8Sherill-Rofe D, Raban O, Findlay S et al. Multi-omics data integration analysis identifies the spliceosome as a key regulator of DNA double-strand break repair. NAR Cancer. 2022;4:zcac 013. 10.1093/narcan/zcac 013.35399185 PMC 8991968 · doi ↗ · pubmed ↗
