A Murine Database of Structural Variants Identifies A Candidate Gene for a Spontaneous Murine Lymphoma Model
Wenlong Ren, Zhuoqing Fang, Egor Dolzhenko, Christopher T. Saunders, Zhuanfen Cheng, Victoria Popic, Gary Peltz

TL;DR
A database of genetic variations in mice identifies a gene linked to lymphoma susceptibility, offering insights into cancer genetics.
Contribution
A high-quality database of structural variants in 40 mouse strains reveals a candidate gene for lymphoma susceptibility.
Findings
A long deletion in SJL mice removes a tumor suppressor gene linked to B-cell lymphoma susceptibility.
The database includes 573,191 structural variants across 40 inbred mouse strains.
The identified SV contributes to a bi-genic model for lymphoma susceptibility in mice.
Abstract
A more complete map of the pattern of genetic variation among inbred mouse strains is essential for characterizing the genetic architecture of the many available mouse genetic models of important biomedical traits. Although structural variants (SVs) are a major component of genetic variation, they have not been adequately characterized among inbred strains due to methodological limitations. To address this, we generate high‐quality long‐read sequencing data for 40 inbred strains; and design a pipeline to optimally identify and validate different types of SVs. This generates a database for 40 inbred strains with 573,191 SVs, which include 10,815 duplications and 2,115 inversions, which also has 70 million SNPs and 7.5 million insertions/deletions. Analysis of this SV database identifies an SV that can be one component of a bi‐genic model for lymphoma susceptibility in SJL mice, which…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
FIGURE 1
FIGURE 2
FIGURE 3
FIGURE 4
FIGURE 5
FIGURE 6| ‒ | ‒ | Cue | Sniffles2 | Sawfish | PBSV | Hifiasm Dipcall |
|---|---|---|---|---|---|---|
| ‒ | ‒ |
|
|
|
|
|
|
|
| 1 | 1 | 1 | 1 | 0.99 |
|
| 0.994 | 0.999 | 1 | 1 | 0.702 | |
|
| 0.996 | 0.9995 | 1 | 1 | 0.821 | |
|
|
| 1 | 0.991 | 0.989 | 0.997 | 1 |
|
| 0.998 | 0.548 | 0.995 | 0.648 | 0.494 | |
|
| 0.999 | 0.706 | 0.992 | 0.785 | 0.661 | |
|
|
| NA | 0.998 | 1 | 1 | 0.987 |
|
| NA | 0.978 | 0.982 | 0.979 | 0.811 | |
|
| NA | 0.988 | 0.991 | 0.989 | 0.89 | |
|
|
| NA | 1 | 1 | 1 | 1 |
|
| NA | 0.182 | 0.358 | 0.13 | 0.936 | |
|
| NA | 0.308 | 0.527 | 0.23 | 0.967 | |
|
|
| 0.999 | 0.908 | W* | 0.967 | NA |
|
| 0.909 | 0.434 | W* | 0.758 | NA | |
|
| 0.952 | 0.587 | W* | 0.85 | NA | |
|
|
| 0.999 | 0.994 | W* | 0.994 | NA |
|
| 0.999 | 0.998 | W* | 0.901 | NA | |
|
| 0.999 | 0.996 | W* | 0.9454 | NA | |
|
|
| 0.997 | 0 | W$ | 0.792 | NA |
|
| 0.976 | 0 | W$ | 0.795 | NA | |
|
| 0.987 | 0 | W$ | 0.793 | NA | |
|
|
| 0.999 | 0.92 | 0.956 | 0.986 | NA |
|
| 0.993 | 0.916 | 0.762 | 0.982 | NA | |
|
| 0.996 | 0.918 | 0.848 | 0.984 | NA |
- —NIH10.13039/100000002
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetic Mapping and Diversity in Plants and Animals · Genetic Associations and Epidemiology · Genomic variations and chromosomal abnormalities
Introduction
1
The laboratory mouse has been the premier model organism for biomedical research, and the large number of phenotypically well‐characterized inbred strains has enabled genetic factors for important biomedical traits to be identified using murine genetic models [1, 2, 3]. However, mouse genetic discovery is critically dependent upon having a complete map of genetic variation among these strains. While SNPs and small indels have been extensively characterized in mouse strains [4, 5, 6], the lack of a comprehensive database of structural variation has limited our ability to fully analyze and interpret the mouse genome. Prior efforts to characterize murine structural variants (SVs) (i.e., genomic alterations >50 bp in size) have included only a few strains [7] and relied on short‐read sequencing (SRS) [8, 9], which has a limited ability to detect SVs in repetitive regions of the genome. Long‐read sequencing (LRS) platforms, which produce reads of length ≥20kb, have improved our ability to identify SVs, especially in repeat‐rich genomic regions [10, 11, 12]. LRS has doubled the estimated number of SVs in the human genome versus prior SRS estimates [10]. Our prior analysis of six strains with LRS revealed that: (i) SVs are very abundant (4.8 per gene), which indicates that they are likely to impact genetic traits; and (ii) as in human studies, the SVs previously identified using SRS [13] accounted for only 25% of those identified by our LRS analysis [14]. A recent analysis of SVs in 14 inbred strains used LRS [15]; but this analysis primarily reported only deletions and insertions. Sequencing and alignment artifacts, along with a heavy reliance on heuristics limit the ability of many existing programs to accurately identify additional types of SVs. We found that characterization of duplications or inversions was particularly problematic, even when murine LRS was analyzed [14]. To comprehensively characterize SVs across the mouse genome, we sequenced 40 inbred mouse strains using high‐accuracy PacBio HiFi long reads. Simulations were used to evaluate the performance of several SV detection methods for identifying different types and sizes of murine SVs. In addition to state‐of‐the‐art alignment [16, 17, 18] and assembly‐based [19] heuristic methods, we also evaluated a recently developed deep‐learning method (Cue [20]) for SV detection. Based upon these simulation results, we designed a custom pipeline to characterize a broader set of SVs from these 40 strains, which included deletions (DEL), insertions (INS), duplications (DUP), and inversions (INV) of varying size. This approach generated a comprehensive database of 573,191SVs among 40 inbred strains that includes a significant number of novel DUPs, INVs and large INS.
The utility of this SV database was demonstrated by identifying a genetic susceptibility factor for an unusual lymphoma that spontaneously appears in SJL mice [21, 22, 23]. These tumors originate in B cell germinal centers and are of interest because some of their features resemble those seen in one type of human non‐Hodgkins lymphoma [24]. SJL lymphomas contain activated T cells, which produce the cytokines required for lymphoma propagation in vitro [25]. One susceptibility factor was identified as an endogenous retrovirus (** Mtv29 **) in the SJL genome that encodes a tumor associated antigen (vSAg29) [26, 27], which stimulates a subset of CD4 T cells [28] to produce the cytokines required for lymphoma development [29]. However, a second genetic susceptibility factor also must contribute because: (i) a strain (MA/My) that expresses vSAg29 does not develop (or has a very low incidence of) lymphoma [30]; and (ii) analyses of SJL intercross progeny indicated that one autosomal dominant genetic factor is present in multiple other strains, which suppresses lymphoma development [31, 32]. Although this tumor suppressor had not been identified in the 55 years since this lymphoma was described in 1969, our AI mouse genetic discovery pipeline [33] identified this second genetic factor as an SJL‐unique SV that ablates a tumor suppressor.
Results
2
Genomic Sequencing and SNP/INDEL Identification
2.1
Genomic sequencing was performed using a PacBio Revio instrument with the HiFi system, which achieves a median read accuracy reaching 99.9% [34], to generate LRS from 40 inbred strains. A total of 3.54 TB of sequence was generated, with an average of 88.5 GB (30× coverage) per strain (Table S1). (The full names and abbreviations used for each strain in this paper are indicated in Table S1) The strains were selected because they are: (i) commonly used by the research community, (ii) well characterized across a range of phenotypes and (iii) have biomedical phenotypes of interest. Since they are commonly used in genetic models, we separately report on SNPs, INDELs and SVs present the 35 classic inbred strains and those present in all 39 sequenced strains, which includes four wild derived strains (CAST/EiJ, SPRET/EiJ, MOLF/EiJ, WSB/EiJ). Using DeepVariant (v1.6.1) [35], 70,051,144 SNP sites and 7,540,144 sites with insertions or deletions (INDELs) were identified in the 39 strains (vs the C57BL/6J reference genome GRCm39). Consistent with our previous finding that the four wild‐derived strains had patterns of genetic variation that were distinct from the 35 commonly used classical inbred strains [36], most of the minor alleles for the SNPs (64.7 M or 92%) and INDELs (6.8 M or 91%) segregated with the 4 wild‐derived strains, which also harbored most of the minor alleles that were present in only one strain. There were 21,331,225 SNP and 2,290,861 INDEL sites in the 35 classical inbred strains; the alleles in 5,286,543 SNPs and 696,031 INDELs were only present in the classic inbred strains; and 4.7 M SNPs and 0.52 M INDELs had alleles that were present in only a single classic inbred strain (Figure S1). The missing genotype rates for SNPs (0.005) and INDELs (0.03) were very low. To ensure that the strains were correctly labeled, we examined the output of our AI mouse genetic analysis pipeline [33] using the new SNP and INDEL databases. The previously identified causative genetic factors for four traits [14, 33, 36, 37] (two caused by SNP alleles and two were caused by INDELs) were identified by the AI using the new SNP and INDEL databases (Figures S2 and S3). Thus, the strains used to generate the SNP and INDEL alleles in the new database are correctly labeled. We compared the SNPs identified here using the PacBio HiFi LRS with the SNPs previously identified using short‐read sequencing (SRS) [5]. Among the SNPs in the 39 strains present in both datasets, 85% of the SNPs overlapped (Figure S4A). The 10% more SNPs identified here is consistent with the improved performance associated with the increased sequencing depth and the use of HiFi LRS methodology by this study. Moreover, the IGV for 200 randomly selected SNPs, which were among 9 M that were only identified by the LRS methodology, were examined. Only 4 of these 200 SNPs (2%) had minor issues that involved misclassification of genotypes (homozygous vs heterozygous) or had complex allelic patterns; and all of these were in SNPs present in wild‐derived strains (Figure S4B). Since the 69.9 M SNPs that were jointly identified by analyses using different sequencing methods can be considered as validated, and the LRS‐only SNPs have a 98% true positive rate; this SNP database has an overall true positive rate of 99.7%.
Assessing SV Identification Programs
2.2
Since SV analysis programs vary in their ability to identify different types of SVs [38, 39], we performed an extensive set of simulations to examine the ability of five programs (Cue [20], Sawfish (v0.10.0) [18], Sniffles2 (v2.3.2) [16], PBSV (v2.9.0) [17], and Dipcall (v0.3) [19]) to detect SVs that were artificially inserted into the mouse genomic sequence. We separately assessed their ability to identify small (<1 kb) and large (1‐100 kb) DELs, INSs, INVs and DUPs (Table 1). The simulation results were used to design the SV pipeline used to assemble this SV database (Figure 1). Whenever possible, we used a consensus‐based SV identification strategy. However, when only a single caller achieved high recall in the simulations (i.e., large INSs), or when low agreement was observed among the programs when the actual data was analyzed (i.e., INVs) (Figure 2A; Figure S5A), the consensus strategy was replaced with one where the individual predictions obtained from one or more high recall tools were selected if they were validated by another program (VaPoR [40]) or by visual inspection. The evaluation metrics and parameter thresholds (especially breakpoint stringency and sequence similarity) can influence the results obtained from SV detection tools [38]. Therefore, we performed simulations using three different parameter settings (default, stringent, relaxed) and evaluated SV caller performance for identification of different types and sizes of SVs (Tables S2 and S3). We found that the SV identification programs we selected were robust to variation in breakpoint matching criteria since only minor differences were observed when three different parameter configurations were used. Overall, our results indicate that DEL and INS can be reliably identified, but improved methods for identification and validation of DUPs and INVs are needed. Nevertheless, an increased number of SVs, which includes DUPs and INVs, were identified by this pipeline.
Summary of the pipeline used to analyze the genomic sequences of 40 inbred mouse strains to produce the SNP, INDEL, and SV databases and their content. Long‐read sequencing of 40 inbred strains was performed, and the C57BL/6J sequence was used as the reference sequence. Deepvariant was used to identify alleles for 70 M SNPs and 7.5 million insertions‐deletions in the 40 strains. SVs were separately analyzed based upon their size (small, 50 to 1000 bp; large, 1 to 100 KB) and type (DEL, INS, INV, DUP). The simulation results (Table 1) and the overlap among SVs identified by the programs were analyzed to produce the SV identification procedures shown in this figure. There were 210,926 small DELs, 47,984 large DELs, and 263,186 small INS that were jointly identified by two of the three analysis programs (Cue, Sniffles2, Sawfish). Since none of these programs could reliably identify large INS, the genomic sequences were individually assembled and Dipcall was used to identify 38,165 large INS that were verified using the VaPoR program. To identify small and large INVs, any INV identified by Cue, Sniffles2, or PBSV that was manually verified using the IGV program was reported. The 9,943 small and 872 large DUPs identified by Cue or PBSV were validated by the VaPoR program. The numbers within parenthesis indicate the number of each type of SV present in the 35 commonly used inbred strains.
Assessing the performance of different SV identification programs. A) SV identification programs have an acceptable level of agreement for identification of small (50–1000 bp) and large (1 kb–500 kb) deletions (DELs), and for small insertions (INSs). The results obtained from analysis of the genomic sequences all 39 inbred strains by each indicated program are shown in separate Venn diagrams. The number of SVs identified by an individual (or a combination of) programs are shown within the circles and the percentages are shown in parenthesis. In contrast, there was substantial divergence in the INVs identified by Cue, PBSV and Sniffles2. Because of this, the INVs identified by each program were verified by manual inspection using the IGV program. B) The VaPoR validation results for DUPs identified by Cue or PBSV. Most of the DUPs identified by Cue or PBSV could not be assessed by VaPoR (represented by NA). The DUPs that passed or failed the VaPoR validation are represented as positive or negative, respectively. We also examined whether DUPs were found within an identified INS: 54% of validated small DUPs were within an INS, but only 4.0% of validated large DUPs were within an INS.
Characterization of SVs
2.3
A total of 210,926 small (50 to 1000 bp) DEL SVs were identified in the 39 inbred strains (vs the C57BL/6J reference), and 80,195 of them were present in the 35 classical inbred strains (Figure 1). There were 47,984 and 21,666 large (1 to 100 KB) DELs identified in the 39 or 35 inbred strains, respectively (Figure 2A; Figure S5A). Like the SNP alleles, most small (61%) and many large (53%) DELs are only present in the four wild‐derived strains. There were 263,186 and 95,622 small INS SVs in the 39 or 35 inbred strains, respectively (Figure 2A; Figure S5A). Dipcall identified 38,165 or 23,379 VaPoR‐validated large INS in all 39 strains or in the 35 classical inbred strains, respectively. Consistent with the simulation results, 94% (38,165 out of 40,460) of the large INS identified by Dipcall were validated by VaPoR, and most large INS that could not be validated had sequence complexity that precluded VaPoR analysis (i.e., rated as not assessable). The 846 (or 275) small INVs and 1,269 (or 420) large INVs were validated by visual inspection using the integrated genome viewer (IGV). In addition, the 9,943 (or 5,420) small DUP and 872 (or 598) large DUP in all 39 (or 35 classical) inbred strains were also validated using VaPoR. Most of the small and large DUPs identified by the analysis programs could not be assessed by VaPoR (Figure 2B; Figure S5B). Among the 573,191 SVs identified, the number of INS (n = 301,351) and DEL (n = 258,910) was far greater than DUPs (n = 10,815) or INVs (2,115) (Figure 3A) in the 39 strains analyzed. SPRET, CAST and MOLF had the largest total number of SVs and of strain‐unique SVs (Figure 3B, C). Three features of this analysis were notable. (i) Most small DUPs (54%) were within INS, while only 4.0% of the large DUPs were within INS. (ii) Only a minority of the DUPs identified by Cue or PBSV could be validated by VaPoR (Figure 2B; Figure S5B); visual inspection confirmed that sequence complexity made it difficult to assess these SVs. (iii) Most SV alleles were shared by three or fewer inbred strains (Figure 3D,E; Figure S6) and were <1 kb in size (Figure S7).
Characteristics of the SV alleles among the inbred strains. A) The total number deletion (DEL), insertion (INS), inversion (INV) or duplication (DUP) SVs detected in all 39 strains are shown. B) The number of each type of SV detected in each of the 39 strains are shown. Each type of SV is indicated by the color shown in A. C) A magnified view showing the number of INVs and DUPs identified in each strain (from data in B). D) The number (left y‐axis) and percentage (right y‐axis) of strain‐unique SVs detected in each of the 39 inbred strains is shown. Three wild derived strains (CAST, SPRET, MOLF) have most of the strain‐unique SVs. E) The number of strains with a shared SV allele are shown for all 39 inbred strains. Each type of SV is indicated by the color shown in A. Most of the minor SV alleles are shared by 1–3 strains. The inset graph shows the number of SV alleles shared by 8 or more strains.
We also compared the SVs identified by our analysis of 39 inbred strains with those identified in a prior study that analyzed SVs in 14 inbred strains [15] (Figure 4A; Data S1). We identified 90,246 more DELs (53% increase) and 55,544 more INS (22% increase) than the prior study; and only 43% of the DELs and 35% of the INS were jointly identified by both studies. Given the increased number of strains that we analyzed, the increased number of DEL and INS identified by this study was expected. The non‐overlapping DEL and INS SVs result from the two wild‐derived strains (PWD/PhJ and PWK/PhJ) analyzed in [15], which were not among the four wild derived strains analyzed here. We also compared the results obtained for 9 strains that were commonly analyzed by both studies (Figure 4B; Data S1). The two studies had a high level of concordance (>85%) for DELs and INSs. However, our study identified 10‐ and 25‐fold more INVs and DUPs, respectively, than the prior study; and there was very little overlap in the INVs (2.4%) and DUPs (0.6%) identified by both studies (Figure 4A). Our use of computational programs, which were selected based upon their performance in simulation studies for identifying INVs and DUPs, accounts for the increased number of INVs and DUPs identified by this study.
Venn diagrams comparing the number of SVs identified in this study (Stanford) from analysis of 39 inbred strains with those identified by the Jackson Laboratory's (Jax) analysis of 14 inbred strains. A) The SV comparisons are separately categorized based upon whether they are a deletion (DEL), insertion (INS), inversion (INV), or duplication (DUP). The total number of each type of SV and the percentages are shown in parenthesis. B) Venn diagrams comparing the number of SVs identified in the 129S1 strain by this study (Stanford) and Jackson Laboratory (Jax) studies.
The variant effect predictor (VEP) program [41] was used to analyze SV impact (Table S4). Manual inspection of the high impact (protein coding) small INS (performed using the integrated genome viewer, IGV) revealed that 99.8% were true positives (2 false positives out of 801 analyzed), and 99.3% of the small DELs were true positives (5 false positive out of 705). Also, >99% of the large INS (1 false positive out of 518) and 98.6% of large DELs (4 false positives out of 293) were true positives (Table S5). To facilitate genetic discovery, the genes with high impact SVs present in the 35 classical inbred strains are provided for small (n = 705 in 654 unique genes) and large (n = 293 in 270 unique genes) DELs, for small (n = 801 in 765 unique genes) and large (n = 518 in 487 unique genes) INS, for all INVs, and for high impact DUPs (Data S2–S5). In summary, the 2,305 high impact SVs with alleles in the 35 classical inbred strains provides a set of highly curated genetic variants that could impact a substantial number of biomedical traits.
We experimentally validated a subset of INVs and DUPs in the database that had complex patterns of variation. The sequences surrounding 14 selected INVs (Table S6) were first examined by generating base‐to‐base alignments of the genomic sequences of strains containing INVs with the C57BL/6J GRCm39 reference sequence. A careful analysis of these alignments confirmed that the INVs were present in the predicted strains. The alignments also show that the INVs contained INS or DEL, which were confirmed by strain‐specific differences in the size PCR amplicons (Figure S8 and Table S7). The analysis of a representative INV is described in detail and the alignment of a 945 bp segment encompassing INV5 (inverted in TH, 129S1 and BTBR mice) and flanking sequence is shown (Figures S9 and S10). While their flanking sequences are well aligned, the INV5 sequence in C57BL/6J is in the opposite orientation of the TH, 129S1 and BTBR sequences. TH, 129S1 and BTBR also have a 200 bp insertion next to INV5, which is why their PCR amplicons are larger than the C57BL/6J amplicon (Figure S9). Sanger sequencing of PCR amplicons confirmed that INV5 was present in TH, 129S1 and BTBR mice (Figure S11). A similar analysis of 13 other INVs also confirmed that INVs were present in the predicted strains (Table S6). The sequences surrounding 8 DUPs (Tables S8 and S9) were also carefully examined by generating base‐to‐base alignments with the C57BL/6J GRCm39 reference sequence. A representative example for the segments encompassing DUP1 is shown (Figure S12). The C57BL/6J DUP1 sequence is duplicated in 129S1 and AJ, which is why they generate larger PCR amplicons (Figure S13). Sanger sequencing of these amplicons confirmed that this duplication was present in 129S1 and AJ (Figure S14). The strain‐specific differences in the size of PCR amplicons generated for 7 other DUPs are also shown (Figure S13). The base‐to‐base alignments, RT‐PCR amplifications and region‐specific DNA sequencing confirm that SVs with complex patterns of genetic variation in our database are valid.
Although we have only a limited ability to interpret the impact of SVs located within intergenic and noncoding regions, chromatin is compartmentalized into topologically associating domains (TADs), which are megabase‐sized genomic segments that are separated by boundary regions [42, 43]. TADs provide a regulatory scaffold for gene expression; they are linked with variation in gene expression because their structure facilitates enhancer‐promoter interactions; and they insulate regions from the effect of other regulators [44]. TADs are created by the binding of a DNA sequence‐specific transcription factor (CCCTC binding factor or CTCF) to its consensus binding element. A multi‐subunit protein (cohesin) then binds to CTCF to form the 3D loop‐like structures that alter gene transcription within a domain. Of note, we found 1,877 DELs that contain a CTCF recognition sequence (CCGCGNGGNGGCAG) among the 35 inbred strains (Data S6). To determine if these DELs could impact chromatin structure, CTCF ChIP‐Seq data from the ENCODE project [45] was examined to identify which of these CTCF recognition sequences were bound by CTCF. Among the 42 CTCF ChIP‐Seq datasets that used mouse tissues or cell lines, CTCF binding occurred at 712 of the CTCF recognition sequences that were contained within the 488 deletion alleles. Since CTCF binding is critical for TAD formation, it is likely these deletions affecting these 712 CTCF recognition sequences could significantly alter chromatin structure; and by this mechanism could affect gene expression patterns and genetic traits. To explore the genomic context of these DELs, we examined their relationship with transposable elements (TEs) across 35 classical inbred mouse strains. Our analysis revealed that 44.4% of the DELs overlapped with annotated TEs, which suggests that the mobile elements make a substantial contribution to structural variation within these genomes. Of note, we also compared our C57BL/6J LRS data with the GRCm39 reference sequence, and found there were only very subtle SNP, INDEL and SV differences (Table S10).
Identification of A Candidate Gene for the 2nd Lymphoma Susceptibility Factor
2.4
Based upon the hypothesis that SJL mice lack a tumor suppressor, the AI mouse genetic pipeline [33] was used to identify lymphoma‐associated (MeSH Term: D008223) genes with high impact SV alleles uniquely present in SJL mice (i.e., absent in the 34 other classic inbred strains). The AI identified a 1641 bp SJL‐unique deletion in high mobility group A1b (Hmga1b), which ablated the exon encoding the entire Hmga1b protein, as the candidate genetic factor (Figure 5). Hmga1b was the only gene with an SJL‐specific high impact DEL that was directly associated with lymphoma. HMGA family members are low molecular weight proteins that bind to AT‐rich regions in nuclear chromatin where they exert positive or negative effects on gene expression by enabling other transcription factors to bind at nearby sites [46, 47]. Two murine genes encode nearly identical HMGA1 proteins [48]. Hmga1b on chromosome 11 encodes a 107 amino acid protein. Hmga1 on chromosome 17 generates two predominant mRNAs that produce: a 96 amino acid protein (whose sequence is identical to Hmga1b except 11 amino acids are deleted; or a 107 amino acid protein whose sequence is identical to Hgma1b, and its mRNA arises by differential splicing (Figure 6A). Analysis of Hmga1 or Hmga1b mRNAs in SJL liver and spleen tissue by RT‐PCR amplification indicated that the SJL mRNAs are identical to those present in other strains. In contrast, the level of expression of Hmga1 (or Hmga1b)‐derived mRNAs in SJL thymus are greatly reduced relative to those in the thymus of other strains (Figure 6B). Because the 3’ UTRs of Hmga1‐ and *Hmga1b‐*encoded mRNAs have a sequence difference at a corresponding site, transcript sequencing was used to identify the gene(s) that encoded these mRNAs in different C57BL/6J tissues. The mRNAs in spleen, liver, bone marrow, kidney and lymph nodes were Hmga1 encoded, while the thymic mRNAs were Hmga1b encoded (Figure 6C; Figure S15). Hence, Hmga1b mRNA expression predominates in a tissue where lymphocyte development occurs. HMGA protein family members are strongly associated with leukemia and lymphoma in mice [49] and humans [47], which explains why reduced HGMA protein function in the SJL thymus contributes to lymphoma susceptibility (discussed below).
The AI pipeline identifies a genetic factor for lymphoma susceptibility in SJL mice. A) Top panel: Lymphoma was treated as a qualitative trait, which appears in SJL mice (incidence = 1), and not in the 34 other classical inbred strains (incidence = 0). Bottom panel: The AI pipeline performed a GWAS to identify haplotype blocks with allelic patterns that corresponded with lymphoma susceptibility in the 35 strains, and then selected those with haplotype blocks that contained a large deletion SV that was only present in SJL (i.e., genetic effect size = 1 and genetic association p‐value = 0). The 16 genes (indicated by gene symbol) meeting these criteria are shown. Within the haplotype box: each block color represents a haplotype for one strain, strains with the same haplotype have the same color, and the blocks are shown in the same strain order as in panel A. The chromosome and the starting and ending position of each haplotype block are also shown. The LitScore represents the strength of the association of each gene with lymphoma as determined by the AI‐mediated literature search (MeSH term: Lymphoma D008223). PubMed identification numbers are only provided for genes that have a direct link with the MeSH term. Otherwise, the AI indicates that the gene has an indirect association, which results from MeSH term relationships identified with other proteins that are associated with the gene candidate. The SV flag indicates the impact of the SV deletion as determined by VEP analysis: high impact, 2; and modifier, ‐1. Only Hmga1b has a high impact deletion and is directly associated with lymphoma. B) Among the 35 classical inbred strains, SJL mice uniquely have a 1641 bp deletion within Hmga1b. This large deletion (as visualized using the integrative genomics viewer) removes the Hmga1b exon that encodes the entire 107 amino acid protein.
Hmga1b is selectively expressed in the thymus. A) A diagram of the Hmga1 and Hmga1b encoded mRNAs. Hmga1b encodes a 1626 bp mRNA, which generates a 107 amino acid protein. Hmga1 generates two principal mRNAs: a 1045 bp mRNA that produces a 107 amino acid protein whose sequence is identical to that of Hmga1b; and a 1631 bp mRNA that produces a 96 amino acid protein whose sequence is identical to Hmga1b, except for the deletion of a 11 amino acid segment. The site of a sequence difference at corresponding sites in the 3’ UTR of the Hmga1 (A) Hmga1b (G) mRNAs is shown. The purple arrows indicate the primers used for PCR amplification of the bands shown in B, and black arrows are the primers used to generate the amplicons used for sequencing in C. B) RT‐PCR was performed on spleen, liver and thymic tissues obtained from C57BL/6J (B6J), TallyHo (TH) and SJL mice. The amplicons from the Hmga1 (96 amino acid protein) and the 107 amino acid Hmga1 (or Hmga1b)‐encoded proteins are indicated by arrows. The pattern of amplicon expression in liver and spleen is similar in all three strains. In contrast, thymic tissue primarily expresses the 107 amino acid protein, SJL thymus has a much lower level of expression this protein. C) The thymus only expresses Hmga1b mRNA. mRNA was prepared from spleen, liver and thymic tissue obtained from C57BL/6J mice. RT‐PCR amplicons from Hmga1 and Hmga1b mRNAs were prepared and sequenced. The 3’ UTR of Hmga1 mRNA has an A, while Hmga1b mRNA has a G at the boxed position. The results indicate that spleen and liver mRNAs are encoded by Hmga1, while thymic mRNAs are encoded by Hmga1b.
Discussion
3
This dataset represents the most comprehensive and carefully performed analysis of SVs in the mouse genome. High quality LRS with a high level of genome coverage for 40 inbred strains, and recently developed state of the art programs (Sawfish [18], Sniffles2 [16], Cue [20]) that were adapted for LRS analysis were used with earlier programs (PBSV, Dipcall) for SV identification. These programs were selected based upon simulation results that indicated that they performed optimally for identification of a certain type/size of SV. Large INS and all DUPs were validated by a separate program, and all INVs were individually validated by manual observation. Our results are consistent with prior observations that no single SV calling algorithm was optimal for detection of the different sizes and types of SVs, and that there can be a high level of divergence when the results of simulated and actual data are compared [39]. Our data also demonstrates that there is a critical need for improved methods for analysis of DUPs and INVs, which probably will require machine learning based programs. Nevertheless, since murine DUPs and INVs were particularly hard to identify, even when LRS was used [14]; the INV and DUP, along with the INS and DELs identified here could facilitate many genetic discoveries. DUPs and INVs are already known to contribute to Autism [50, 51] and to impact brain function [50]. Recent analyses have indicated that segmental duplications occupy ≈7% of the human genome [52]. One study limitation is that the currently used SV callers were designed and evaluated using human genomic sequences. Since the genomic sequences of the inbred strains differ from outbred human populations, the existing programs may need to be further refined for mouse genomes. Another study limitation is that tandem repeats (TRs) – a type of SV with multiple repeats of short DNA sequence motifs, which are associated with multiple human diseases [53, 54] – were not covered here; but were analyzed in our other paper [55]. Also, the increased variability between the genomic sequences of the wild‐derived and classical inbred stains could also limit the ability of alignment‐based programs for SV identification.
The discovery of a candidate second genetic susceptibility factor for SJL lymphoma was facilitated by this murine SV database and the AI genetic discovery pipeline. A novel bi‐genic model explains why B cell lymphomas uniquely develop in SJL mice, and this model is consistent with all available data about its pathogenesis. The first genetic factor is a protein (vSAg29) encoded by an endogenous retrovirus in SJL mice [26, 27] that stimulates CD4 T cells to produce cytokines [28] required for B cell lymphoma development [29]. The second genetic factor is an SJL‐unique SV generates an Hmga1b KO. This SJL‐unique genetic factor had been predicted by murine intercross experiments, which indicated that a genetic factor present in multiple other strains suppressed lymphoma development [31, 32]. The Hmga1b deletion SV allele is also present in 3 wild derived strains (CAST, SPRET, MOLF); and the endogenous retrovirus is present in another inbred strain (MA/My); but these strains do not spontaneously develop lymphomas. Hence, a unique combination of two genetic factors (vSAg29 and the Hmga1b deletion SV) is required to produce lymphomas in SJL mice. It will be important to experimentally validate the contribution of the Hmga1b deletion SV to the pathogenesis of lymphoma in SJL mice. Hopefully, the genetic architecture of other murine genetic models of important biomedical traits will be uncovered using this SV database.
How could a SJL SV allele that ablates Hmga1b promote lymphoma development? While Hmga1 and Hmga1b have virtually identical protein sequences, our data demonstrates that only Hmga1b mRNAs are expressed in the thymus. A series of in vitro studies demonstrated that Hmga1 represses the expression of the Recombination activating gene 2 protein (RAG2) endonuclease [56], which plays a key role in lymphocyte development; but increased RAG expression can cause DNA damage and an increased risk for oncologic transformation [57]. Hmga1 knockout (KO) mice have increased RAG2 activity in their spleens [56]. An Hmga1 KO altered T cell development, increased B cell development, and caused the mice to develop B cell lymphomas and other hematopoietic malignancies [58, 59]. SJL spleen and lymph node tissues have an increased number of germinal centers with IL‐21 producing T follicular helper (TfH) cells, an increased level of IL‐21 production; and SJL lymphoma development is IL‐21 dependent [29]. Hence, the HMGA tumor suppressor function is (at least partly) mediated through repression of RAG2; loss of this repressor function in the SJL thymus will alter T cell development and could induce DNA translocations in other tissues. The abnormalities in thymic T cell development along with vSAg29‐induced T cell proliferation explains why SJL (and Hmga1 KO) mice have an increase in IL‐21 producing TfH cells in their germinal centers, which leads to the development of a population of B cells with a high level of abnormal IgH rearrangements. By this mechanism, the activity of an expanded and abnormal population of TfH cells generates an IL‐21 dependent B cell lymphoma in SJL mice [56, 59].
Understanding the genetic architecture of SJL lymphoma susceptibility could provide new insight into the pathogenesis of certain types of human lymphomas. For example, SJL lymphomas have transcriptomic and phenotypic similarities with one type of non‐Hodgkin human lymphoma: angioimmunoblastic T‐cell lymphoma (AITL) [24, 29]. Like SJL lymphomas, AITL develops late in life, is driven by IL‐21‐producing TfH cells [60], and it sometimes develops into a B cell lymphoma [61]. Analogous to the role of vSAg29 and the Hmga1b deletion allele in driving lymphomas in SJL mice, Epstein Barr virus (EBV) is detected in 66%–86% of AITL patients [62, 63] and mutations in genes encoding epigenetic modifiers have frequently been detected in AITL patients [24]. Also, IL‐21 induces the expression of EBV latent membrane protein 1 (LMP1) in human B cells [64], which has been shown to provide survival signals for B cells [65] and can rescue transformed cells from apoptosis [66]. Given the similarities in their pathogenesis, additional studies on SJL lymphomas could provide new information about how infectious agents and host genetic factors jointly contribute to the pathogenesis of AITL, and possibly other types of lymphomas.
Limitations of the Study
3.1
Despite the use of deep learning and alignment‐based approaches to identify inversions and duplications, the low concordance among SV identified by the different methods highlights the need to develop more sophisticated algorithms for identifying these types of SV. We used VaPoR for validation of duplications, but many duplications occurring in structurally complex regions failed to validate, which provides additional evidence of the need to develop high‐throughput tools for confirmation of inversions and duplications. Finally, reliance on a single reference genome introduces alignment and reference bias, which makes it more difficult to identify novel insertions or strain‐specific sequence differences. Collectively, these limitations highlight the importance of using high‐fidelity long read sequencing platforms and multiple programs for SV detection and validation; and we need to develop reference sequences assemblies for multiple strains to capture the full spectrum of structural variation among inbred mouse strains.
Experimental Section
4
Animal Experiments
4.1
All animal experiments were performed according to protocols that were approved by the Stanford Institutional Animal Care and Use Committee (APLAC‐21574). All mice were obtained from Jackson Labs, and the results were reported according to the ARRIVE guidelines [67].
DNA Sequencing
4.2
Forty inbred strains (Table S1) were subject to LRS using the HiFi REVIO system (PacBio). For thirty strains, mouse liver was obtained from mice purchased from the Jackson Laboratory, snap‐frozen in liquid nitrogen and shipped on dry‐ice to the DNA Technologies Core of the Genome Center, University of California Davis were high molecular DNA purification and REVIO sequencing was performed. The genomic DNA for ten strains (the bottom 10 listed in Table S1) were kindly provided by Dr. Laura Reinholdt, Co‐Director of the Mutant Mouse Resource and Research Center at the Jackson Laboratory (Bar Harbor, ME); and REVIO sequencing was also performed at the UC Davis Genome Center.
Sequencing Data Quality Control
4.3
Our study utilized PacBio HiFi sequencing technology, which was known for its high accuracy in single‐molecule real‐time (SMRT) sequencing. The quality control metrics for our sequencing data include:
- Accuracy Yield: We achieved high‐quality bases (>QV20/99% accuracy) with a yield typically ranging between 75–105 Gb, averaging around 90 Gb on the Revio system.
- Fragment Size: The SMRTbell library was prepared with an optimal fragment size of 15–20 kb for HiFi whole‐genome sequencing.
- Read Length: The mean HiFi read length was between 15–20 kb.
- Read Quality: The median HiFi read quality was Q30, with a range from Q28 to Q32.
- Sequencing Control: We employed a 110 kb control sequence that was supplied by PacBio. Analysis of the data obtained indicated that we had control read lengths of ≈70 kb (ranging from 50–100 kb) and read counts exceeding 1000.
Identification of SNPs and INDELs
4.4
PacBio HiFi long read sequence (LRS) data were aligned to GRCm39 (mm39) reference genome using minimap2 (v2.28) [68] and pbmm2 (v1.13.1) to generate the BAM files. SNP and INDEL alleles were identified for each strain using the GPU‐based DeepVariant (v1.6.1) [35] and the alleles from all strains were merged using GLnexus (v1.4.1) [69].
Alignment, Variant Filtering and Quality Control for SNPs and INDELs
4.5
After alignment, SNPs and INDELs were identified using DeepVariant, they were consolidated with GLnexus. Then, a series of stringent filtering criteria were applied using BCFtools (v1.21) [70] to ensure the reliability of our variant calls Algorithm 1.
Algorithm 1 1.Sequence Mapping: We employed two aligners‐minimap2 and pbmm2‐to map the sequences. For example, for the 129S1 strain, the following commands were used:For minimap2:minimap2 –c –a ‐‐MD –x map‐hifi –t 96 /path1/mm39.fa /path2/129S1.fastq.gz > /path3/129S1.align.bamFor pbmm2:pbmm2 align /path1/mm39.fa /path2/129S1.fastq.gz /path3/129S1.align.sort.pbmm2.bam ‐‐preset HIFI ‐‐sort ‐‐rg ‘@RG∖tID:129S1∖tSM:129S1’ ‐j 64These parameters ensure high‐quality alignment by applying options that were optimal for PacBio HiFi reads.2.Variant Calling for SNPs/INDELs. We used the GPU version of DeepVariant with default options tailored for PacBio data. For instance, the DeepVariant command was executed as follows:docker run ‐‐name 129S1 ‐‐gpus ‘“device = 0”’ ‐v “/path1/input”:“/input” ‐v “/path2/output”:“/output” google/deepvariant:“1.6.1‐gpu” /run_deepvariant ‐‐model_type = PACBIO ‐‐ref = /input/path3/mm39.fa ‐‐reads = /input/minimap2/129S1.align.sort.bam ‐‐output_vcf = /output/129S1.raw.mini.vcf.gz ‐output_gvcf = /output/129S1.g.raw.mini.vcf.gz ‐‐intermediate_results_dir /output/129S1_intermediate_results_dir ‐‐num_shards = 32 ‐‐sample_name = 129S1And then:glnexus_cli ‐‐config DeepVariant *.g.raw.mini.vcf.gz > deepvariant.bcf ‐‐dir scratch ‐‐trim‐uncalled‐alleles ‐‐threads 963.Chromosomal Regions: Filtered to include only autosomal chromosomes (chr1 to chr19) and the X chromosome.4.Filter Status: Included variants where the FILTER field was marked as “PASS”.5.Quality Score: Excluded variants with a QUAL score below 30.6.Depth and Genotype Quality: Retained variants with a read depth (DP) of at least 10 and a genotype quality (GQ) of at least 30.7.Genotype Call: Selected variants with a homozygous alternate genotype (“1/1”).8.Variant Type: Filtered to retain only biallelic SNPs and INDELs.John Wiley & Sons, Ltd.
Simulations for Assessing SV Program Performance
4.6
We generated 8 synthetic genomes using insilicoSV [71], which inserted 1000 simulated SVs of each type and size into the GRCm39 reference sequence. For each synthetic genome, PacBio HiFi reads at 30× coverage using PBSIM3 (v3.0.4) [72] were simulated, and the simulated reads were aligned using minimap2.
Assessing SV Identification Programs
4.7
Since SV analysis programs vary in their ability to identify different types of SVs [38, 39, 73], we performed a set of targeted simulations to examine the ability of five programs (Cue [20], Sawfish (v0.10.0) [18], Sniffles2 (v2.3.2) [16], PBSV (v2.9.0) [17], and Dipcall (v0.3) [19]) to detect SVs that were artificially inserted into the mouse genomic sequence. We separately assessed their ability to identify small (<1 kb) and large (1‐100 kb) DELs, INSs, INVs and DUPs (Table 1). This strategy measures the upper bound on recall for each program and SV category to identify those that systematically miss specific types of SVs. Multiple methods had high recall (>90%) for small and large DELs (Cue, Sniffles2, Sawfish and PBSV), large DUPs (Cue, PBSV), large INVs (Cue, Sniffles2 and PBSV), and small INS (Sniffles2, Sawfish and PBSV). However, only Dipcall (which used individually assembled genomic sequences) could reliably detect large INS.
The simulation results were used to select the SV identification programs and database curation methods used to assemble the SV database (Figure 1). A recent analysis of human LRS data revealed that SVs found by only one program were more likely to be false positives [38]. Therefore, a consensus‐based approach was used for SV types where multiple programs exhibited high recall in the simulations, and where a high level of agreement was observed when the real data was analyzed by these programs. Also, whenever possible, we used a machine learning‐based method (Cue) with heuristic‐based methods (Sawfish, Sniffles2 and PBSV), since programs that use similar heuristics may jointly produce false positives. For example, 85% of the small DELs and 74% of the large DELs were commonly identified by Cue, Sawfish and Sniffles2, which indicates that these SVs were valid (Figure 2A; Figure S5A). However, if only a single caller achieved high recall in the simulations (i.e., large INSs), or when low agreement was observed among the programs when the actual data was analyzed (i.e., INVs) (Figure 2A; Figure S5A), the consensus strategy was replaced with one where the individual predictions obtained from one or more high recall tools were selected if they could be validated by a computational program or visual inspection. We used VaPoR [40], which was a program that autonomously validates SVs identified by analysis of LRS data, to identify the high‐quality SVs. For example, many of the validated INVs were only identified by two of the three analysis programs utilized, and most small and large DUPs were only identified by one program (Figure 2A; Figure S5A). The low level of overlap supports our strategy of using different methods but only incorporating SVs, which were validated by another method, into the database. We were aware that the VaPoR validation requirement was likely to eliminate some true positive SVs, but we accept this risk to ensure that only true positive SVs were included in the database.
The simulation results for Sniffles2 and Sawfish for inversions and small DUPs were withheld due to performance issues that became apparent while evaluating the results produced by those programs. First, we examined why the simulations for recognition of small DUPs by Sniffles2 and Sawfish produced low values for precision, recall and F1 (W^$^). By changing the parameters [typeignore = true] used by the Truvari (v4.2.2) [74] software to calculate the precision, recall and F1 values, we found that small DUPs were incorrectly labelled by these programs as INS. Although we used IGV images to visually validate the high impact INS, it was possible that some DUPs were labelled as INS. Second, we also found a bug that was specifically present in the early version of Sawfish that was used in this study interfered with the recognition of INVs (W*).
Identification of SVs
4.8
To identify small and large deletions, Cue and Sniffles2 were used to perform SV calling based on the minimap2 alignment, and Sawfish was used for SV calling based on the pbmm2 alignment. Then, SURVIVOR (v1.0.7) [75] was used to merge the results obtained by these three programs. The SVs commonly identified by at least two of these programs were included in the final dataset. Sniffles2 and Sawfish were used to identify small insertions; the small insertions jointly identify by both programs were merged using SURVIVOR and were incorporated into the final dataset. An assembly‐based program Dipcall was used to identify large insertions, and VaPoR was used for their validation. The large INS with a VaPoR statistic of GS≥0.15 and QS ≥ 0.1 were considered to be validated, and were incorporated into the final dataset. Cue, Sniffles2 and PBSV were used to identify INVs: the Cue and Sniffles2 analyses were based on minimap2 alignments, whereas the PBSV analysis was based on pbmm2 alignment. SURVIVOR was used to merge those identified by any of the three methods; these INVs were then inspected using the Integrative Genomics Viewer (IGV) (v2.17.4) [76] as described below, and visually‐validated INVs were incorporated into the dataset. DUPs were identified by PBSV or Cue, and then validated by VaPoR as described above. In addition, the starting and ending positions of the identified DUPs were examined to determine if they were located within an identified insertion. Functional annotation was performed using the Ensembl Variant Effect Predictor (VEP) (v112.0) [41].
SV Variant Filtering and Quality Control
4.9
Additional information about sequence quality, read support, genotype calls and quality control metrics used for SV identification.
- Sequence Mapping Quality. For Cue, Sniffles2, Sawfish, and Dipcall, we applied a filter of QUAL > 30 to ensure high mapping quality. Note that PBSV does not provide a corresponding parameter for this metric.
- Read Support for Each Variant. For Sniffles2, we required a minimum read support of SUPPORT ≥ 5 (where SUPPORT represents the number of reads supporting the SV).
- For PBSV, we applied a read depth filter of DP ≥ 10 (with DP indicating the read depth at the variant position)
- Genotype Quality. Both Sawfish and Sniffles2 required a genotype quality (GQ) of at least 30 to retain a variant call.
- Additional Filtering using BCFtools for all methods. We further refined the SV calls by applying the following criteria:
- Variants were restricted to the 19 autosomes and the X chromosome
- Only variants with FILTER = “PASS” were retained.
- Variants with an SV length (SVLEN) between 50 and 500,000 bp were included.
- Variants annotated as SVTYPE = “BND” were excluded.
- Genotype calls of “0/0” were removed.
Visual Inspection Using IGV
4.10
Visual inspection of high impact deletions and insertions, and all inversions was performed by examining IGV images. To do this, the IGV image script code was modified as in following Algorithm 2:
Algorithm 2 “newgenome mm39snapshotDirectory /pathway1/load /pathway2/name.identify.vcfload /pathway3/name.alignment.bamgoto chr*:start‐end positionscrollToToppreference SAM.COLOR_BY READ_STRANDpreference SAM.LINK_READS TRUEpreference SAM.LINK_TAG READNAMEpreference SAM.GROUP_OPTION ZMWpreference SAM.SHOW_MISMATCHES TRUEpreference SAM.MAX_VISIBLE_RANGE 500snapshot output.pngexit”John Wiley & Sons, Ltd.
We found that altering three of the six display settings was important. (i) The “preference SAM.COLOR_BY READ_STRAND” indicates that reads from different strands were displayed in different colors. (ii) The “preference SAM.GROUP_OPTION ZMW” specifies the ZMW group option, which was optimal for analysis of PacBio long read sequence. (iii) The “preference SAM.MAX_VISIBLE_RANGE 500” sets the maximum length of a displayed SV at 500 KB. The use of the alternate parameters improved the display of the inversions. Of note, we used split‐read alignments with strand flips when validating INVs by visual inspection, but this approach has limitations since it can only validate INVs where split‐read alignments were available.
Experimental Validation of SVs
4.11
To validate a selected set of INVs and DUPs in our SV database, their corresponding sequences in our LRS data were aligned to the C57BL/6J reference sequence (GRCm39). It was challenging to align the sequences containing INVs, since they were often in regions with large INS or DEL. GeneiousPrime (with or without the Minimap2 plugin) combined with BLAT analysis was used to align the sequences flanking the INVs, and the INVs within these regions were identified. PCR primers were designed using the fully aligned sequences, and all PCR primers were examined for uniqueness and for secondary structure. Then, the flanking sequences of the INVs were PCR amplified from the mouse genomic DNA obtained from strains with or without the INVs. C57BL/6J genomic DNA was included in the analyses as a control. The PCR amplicons were first analyzed by agarose gel electrophoresis and were then purified for Sanger‐sequencing (Mc Lab, South San Francisco, CA). For amplicons > 1 kb, internal primers were generated to enable sequencing across the entire amplicon. The PCR primers were provided in (Tables S7 and S9). Genomic DNAs were prepared from liver tissues (20‐35 mg) obtained from 129S1, AJ, AKR, BALB, BTBR, CBA, DBA1J, DBA2J, FVB, NOD, NZB, NZW, KK, SJL, SWR, TALLYHO (TH) and C57BL/6J mice using the Nanobind tissue kit from PacBio.
Predicting CTCF Binding Sites
4.12
The predicted CTCF binding sites in the genomic sequences of the inbred strains were identified using the CTCF (v0.99.11) [77] R package in Bioconductor with the JASPAR 2022 database and the recommended P‐value cutoff (1×10^−6^). We then examined the functional relevance of these predicted CTCF binding sites using ChIP‐Seq data for CTCF from the ENCODE project [45]. Specifically, we obtained 42 narrowPeak BED files with ChIP‐Seq experiments conducted on various mouse tissues or cell types, which had the assay type designated as “DNA binding,” the assay title as “TF ChIP‐seq,” and the assay target as “CTCF,” These datasets were aligned to the GRCm38 reference genome. To ensure consistency, we first converted our predicted CTCF binding site coordinates from GRCm39 to GRCm38 using the Lift Genome Annotations tool [78]. For the ChIP‐Seq data, we applied stringent filtering criteria, selecting peaks with a signal intensity of at least 50 and a ‐log_10_(q‐value) greater than 3 to ensure that high quality data was analyzed. We then utilized the BEDTools (v2.29.1) [79] suite to compare our predicted CTCF binding sites with the ChIP‐Seq peaks. A CTCF site was considered as bound if at least 50% of its length overlapped with a ChIP‐Seq peak. We then recorded the number of supporting peaks obtained from the 42 datasets and documented their respective sources.
Relationship Between DELs and TEs
4.13
To explore the relationship between DELs and transposable elements (TEs), we obtained TE annotations from the RepeatMasker track via the UCSC Table Browser (https://genome.ucsc.edu/cgi‐bin/hgTables) using the following settings: Genome: Mouse; Assembly: GRCm39/mm39; Group: Variation and Repeats; Track: RepeatMasker; Region: Genome; Output format: BED. The resulting file was downloaded with a user‐defined filename. DELs were extracted from VCF files corresponding to 35 classical inbred mouse strains, including their chromosomal starting and ending positions, and these subsequently converted to BED format. BEDTools (v2.29.1) was employed to assess overlaps between DELs and TEs. DELs exhibiting ≥60% overlap with TE regions were considered as TE associated.
SNP and SV Dataset Comparisons
4.14
To compare the SNPs identified here using the PacBio HiFi long‐read sequencing (LRS) with the previously identified SNPs, which were identified using short‐read sequencing (SRS) [5], we selected 39 strains that were found in both datasets. The SRS variant calls underwent the same variant filtering process as the LRS SV calls. Since the SRS results were aligned to the GRCm38 reference genome, the liftOver tool was used to convert their genomic coordinates to the GRCm39 reference. For comparison, we generated unique variant identifiers by concatenating the chromosome, position, reference allele, and alternative allele for each SNP. These variant IDs were then compared using the “intersect()” and “setdiff()” functions in R, which enabled us to identify the shared and unique variants in the two datasets.
We also globally compared the SVs identified in this study with those identified by the Jackson Laboratory's SV analysis and performed strain‐to‐strain comparisons of SVs using nine inbred strains that were shared between our study and the Jackson Laboratory's analysis [15]. For this analysis, deletions (DEL), insertions (INS), duplications (DUP), and inversions (INV) were separately analyzed. Comparisons were performed using a position tolerance‐based approach. SVs from different datasets were considered to represent the same variant if their starting and ending positions, and their lengths were less than 50 bp. If any of these conditions were not met, the SVs were considered as distinct. The IDs of the shared and unique SVs in the two datasets were provided in Zenodo (https://doi.org/10.5281/zenodo.15571541), with each ID formatted as “chromosome‐start_position‐SV_type‐SV_length” (e.g., “chr1‐3101582‐DEL‐357”).
C57BL/6J Sequence Comparison
4.15
We report C57BL/6J alleles for each variant (based upon the GRCm39 reference sequence), which were identified by comparison with the other 39 strains. However, we also conducted a variant calling analysis comparison of our C57BL/6J data to the C57BL/6J GRCm39 reference sequence produced by the Genome Reference Consortium. Even when a reference‐based approach was used, subtle differences can be identified when sequences from the same strain were compared. We found very subtle increases in the number of SNPs (11.7K/70M or 0.017% of the total), INDELs (9.8K/7.5M or 0.13%), were identified using our LRS data vs GRCm39; and a 1.4% increase in the number of insertions (>90% small INS) identified using our LRS data vs the GRCm39 (Table S10). These results were like those previously observed by others when new C57BL/6J was compared with the reference sequence [80]. We suspect that the increased number of insertions identified using our LRS data was due to improved sequencing technology (HiFi LRS vs the short read sequencing used to compile GRCm39). However, we can't definitively determine whether sequencing technology differences or genetic divergence between the C57BL/6J strains used by us and the Genome Reference Consortium. Since the GRCm39 was universally used by scientists, the sequence comparisons used to produce this database were performed relative to GRCm39.
Analysis of Hmga1b and Hmga1 mRNAs
4.16
Total RNA was purified from mouse thymus, spleen and liver obtained from SJL, C57BL/6J and TallyHo mice; and from kidney and hind leg bone marrow of SJL and C57BL/6J mice using the TRIzol (Thermo Fisher) reagent with the Direct‐zol RNA miniprep kit (Zymo Research). One half ug each of the total RNAs were reverse transcribed using the High‐Capacity cDNA Reverse Transcription kit (Applied Biosystems) in a 20 µl volume. One µl each of the cDNAs were then subject to PCR using the GoTaq G2 DNA polymerase master mix (Promega) and primers for Hmga1b, Hmga1 or Gapdh (control) transcripts. Hmga1b/Hmga1 transcript primers: Hmga‐F1: AGCGAGTCGGGCTCAAAGTC and Hmga‐R1: CGCCCTTATTCTTGCTTCCCTTT. Primers for either the 107aa transcript or the 96aa transcripts were: Hmg‐107aa‐F1, TGAGTCCTGGGACGGCGCT, Hmg‐96aa‐F, AGCAGCCTCCGAAAGAGCC, Hmga‐R, GAATGCTCCCAGGACCCTCTA. Primers for the Gapdh transcript: Gapdh‐F2: GTAGACAAAATGGTGAAGGTCGGT and Gapdh‐R1, GGTCCAGGGTTTCTTACTCCTTG. The PCR amplified cDNA generated using Hmg‐107aa‐F1 and Hmga‐R (the 107aa transcripts only) primers or the Hmg‐96aa‐F and Hmga‐R (the 96aa transcripts only) primers were gel purified and then subject to Sanger sequencing.
Analysis of Other Genes Identified by the Mouse Genetic AI Pipeline
4.17
In addition to Hmga1b, only three other genes with SJL‐specific large deletions were identified as having a potential direct association with lymphoma. (i) Il2ra was associated with lymphoma because Il2ra mRNA expression was a prognostic marker for Burkitt lymphoma, acute myelogenous leukemia, mycosis fungoides, diffuse large B cell lymphoma and other cancers; and IL‐2 was administered as an anti‐cancer agent. The absence of Il2ra could reduce T and NK cell mediated anti‐cancer immunity, which potentially could facilitate B cell lymphoma development. However, the Il2ra SV was not high impact (it labeled as a modifier and does not affect the coding sequence), and will not have a major effect on Il2ra expression. (ii) Although Pfkb3 was identified as having a direct association with lymphoma, the paper identified by the AI was a false positive association with lymphoma. (iii) Ccdc57 was identified by the AI because it was within the same haplotype block as Hmga1b; but Ccdc57 has no direct association with lymphoma. Ccdc57 was in the Human Protein Atlas, which contains lymphoma tissue. No other gene with SJL‐specific high impact large deletion was directly associated with lymphoma.
Statistical Analysis
4.18
Pre‐processing of variant data, including count aggregation and classification by variant type and size, was performed using custom scripts in R (v4.4.0). Counts and percentages were calculated directly from the curated variant datasets without additional transformation or normalization.
Author Contributions
Conceptualization, G.P.; methodology, W.R., F.Q., V.P. and G.P.; formal analysis, software, and validation, W.R., E.D., C.T.S., Z.C., V.P.; visualization, W.R.; writing – original draft, W.R. and G.P.; writing – review & editing, W.R. and G.P.; funding acquisition, V.P. and G.P.; supervision, G.P.
Conflicts of Interest
W.R., Z.F, Z.C., V.P. and G.P. declare no conflict of interest. E.D. and C.T.S. are employees and shareholders of Pacific Biosciences.
Supporting information
Supporting File 1: advs73783‐sup‐0001‐SuppMat.pdf.
Supporting File 2: advs73783‐sup‐0002‐SuppMat.xlsx.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1J. Wang , G. Liao , J. Usuka , and G. Peltz , “Computational Genetics: From Mouse to human?,” Trends in Genetics 21, no. 9 (2005): 526–532.16009447 10.1016/j.tig.2005.06.010 · doi ↗ · pubmed ↗
- 2M. Zheng , D. Dill , and G. Peltz , “A Better Prognosis for Genetic Association Studies in Mice,” Trends in Genetics 28, no. 2 (2012): 62–69.22118772 10.1016/j.tig.2011.10.006PMC 3268904 · doi ↗ · pubmed ↗
- 3Z. Fang and G. Peltz , “Twenty‐first Century Mouse Genetics Is Again at an Inflection Point,” Lab Animal 54, no. 1 (2025): 9–15.39592878 10.1038/s 41684-024-01491-3PMC 11695262 · doi ↗ · pubmed ↗
- 4G. Peltz , A. K. Zaas , M. Zheng , et al., “Next‐generation Computational Genetic Analysis: Multiple Complement Alleles Control Survival after Candida albicans Infection,” Infection and Immunity 79, no. 11 (2011): 4472.21875959 10.1128/IAI.05666-11PMC 3257944 · doi ↗ · pubmed ↗
- 5A. Arslan , Y. Guan , Z. Fang , et al., “High Throughput Computational Mouse Genetic Analysis,” bio Rxiv 2021.
- 6R. L. Ball , M. A. Bogue , H. Liang , et al., “Genome MU Ster Mouse Genetic Variation Service Enables Multitrait, Multipopulation Data Integration and Analysis,” Genome Research 34, no. 1 (2024): 145–159.38290977 10.1101/gr.278157.123PMC 10903950 · doi ↗ · pubmed ↗
- 7M. Mortazavi , Y. Ren , S. Saini , et al., “SN Ps, Short Tandem Repeats, and Structural Variants Are Responsible for Differential Gene Expression across C 57BL/6 and C 57BL/10 Substrains,” Cell Genomics 2, no. 3 (2022): 100102, 10.1016/j.xgen.2022.100102.35720252 PMC 9205302 · doi ↗ · pubmed ↗
- 8A. G. Doran , K. Wong , J. Flint , D. J. Adams , K. W. Hunter , and T. M. Keane , “Deep Genome Sequencing and Variation Analysis of 13 Inbred Mouse Strains Defines Candidate Phenotypic Alleles, Private Variation and Homozygous Truncating Mutations,” Genome Biology 17, no. 1 (2016): 167.27480531 10.1186/s 13059-016-1024-y PMC 4968449 · doi ↗ · pubmed ↗
