Exonize: a tool for finding and classifying exon duplications in annotated genomes
Marina Herrera Sarrias, Christopher W Wheat, Liam M Longo, Lars Arvestad

TL;DR
Exonize is a new tool that identifies and classifies duplicated exons in genomes to better understand exon evolution.
Contribution
Exonize introduces a graph-based computational method for detecting and analyzing exon duplications in annotated genomes.
Findings
Exonize detects full-exon duplications in at least 4% of vertebrate genes.
Over 900 human genes contain full-exon duplication events.
The tool identifies unannotated or degenerate exons by analyzing duplication events.
Abstract
The protein-coding regions of eukaryotic genes are fragmented into exons that, like the genes within which they are situated, can be duplicated, deleted, or reorganized. Cataloging and organizing within-gene exon similarities is necessary for a systematic study of exon evolution and its consequences. To facilitate the study of exon duplications, we present Exonize, a computational tool that identifies and classifies coding exon duplications in annotated genomes. Exonize implements a graph-based framework to handle clusters of related exons resulting from repeated rounds of exon duplication. The interdependence between duplicated exons or groups of exons across transcripts is classified. By identifying duplication events between exonic and intronic regions, Exonize can detect unannotated or degenerate exons. To aid in data parsing and downstream analysis, the Python module…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1|
|
- —Swedish Research Council10.13039/501100004359
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · RNA and protein synthesis mechanisms · Machine Learning in Bioinformatics
Introduction
Sequence repetition is a key driver of protein evolution, from fragment duplication in the structural evolution of symmetric protein folds (Eck and Dayhoff 1966) to gene duplication as a precursor to functional divergence (Ohno 1970). Upon the emergence of exons, which can include protein-coding regions (Aspden et al. 2023), a new form of evolution by repetition became possible: exon duplication (Letunic et al. 2002, Ivanov and Pervouchine 2022). Previous studies on exon duplication dynamics (Letunic et al. 2002, Ivanov and Pervouchine 2022, Martinez Gomez et al. 2021, 2022) have established exon duplication as a common evolutionary process in eukaryotes. However, at present, there is no accessible and user-friendly tool for identifying exon duplications. Although alignment tools such as Exonerate (Slater and Birney 2005) and GMAP (Wu and Watanabe 2005) are widely used for modeling gene structure, they do not provide dedicated functionality for detecting and classifying exon duplication events. If the evolutionary consequences of exon duplications are to be fully understood, tools to identify and classify exon duplication events must be readily available. To this end, we introduce Exonize, a tool for finding and classifying coding exon duplications in annotated genomes.
The design of Exonize centers on three goals: Foremost, Exonize identifies and groups duplicated exons within protein-coding regions, i.e. protein-coding exons, hereafter referred to as exons for brevity. Next, Exonize classifies the relationship between duplication events with respect to exonic and intronic boundaries within a gene, potentially identifying unannotated or degenerate (pseudo) exons. Finally, Exonize classifies the interdependence between duplicates or groups of duplicates across transcripts. Exonize facilitates comparisons between multiple species and allows a systematic assessment of the influence of repeat masking.
Following a detailed description of the implementation of Exonize, we demonstrate that it detects nearly all exon duplication events in a simulated dataset and most of the events in the manually curated dataset presented in (Martinez Gomez et al. 2021). Application of Exonize to 20 eukaryotic genomes reveals that 4%–7% of vertebrate genes contain at least one full coding exon duplication and identifies hundreds of putative exon duplicates within each genome.
Overview
Exonize requires a genome assembly in FASTA format and corresponding gene annotations in GFF or GTF format. Only protein-coding exons, regions annotated as coding sequences (CDS), are considered in the search. As annotated exon boundaries can overlap or be nested, a set of representative exons is chosen for each gene. Since exons can have sequence similarity unrelated to an exon duplication process, as in repetitive protein folds, length similarity is required.
We define a full-exon duplication as a protein-coding exon that has been duplicated in its entirety within the boundaries of its parent gene. Exonize detects full-exon duplications using global and local alignment methods. The global approach identifies alignments between representative exons of similar length using Muscle5 (Edgar 2022) and is subject to amino acid identity and coverage cutoffs. The local approach complements the global search by querying representative exons against the entire gene, including introns, using tblastx (Camacho et al. 2009) and is subject to E-value and coverage cutoffs. Regions of alignment found by both methods are referred to as matches. Users have the option to restrict the search to a single method or combine results from both methods. An overview of Exonize is provided in Fig. 1A.
(a) The Exonize pipeline, with default parameters and search criteria indicated in dotted boxes. The database icon denotes data available in the output database. (b) Illustration of the match classification system. Note that the query and the target genes are the same gene. Boxes indicate exons, and shading indicates matches. Self-hits are not shown for clarity. (c) A directed match graph for a gene with two exon duplications (two expansions). (d) Transcript interdependence classification for a gene with two transcripts (mRNA1 and mRNA2) and two tandem exon duplicates, e1 and e2. I3 represents a partial duplication event, and I4 illustrates an intronic match and is thus excluded from interdependence classification.
Representative exon selection
A gene g is defined here as a collection of transcripts consisting of a set of exons defined as closed intervals of the form [a, b], where a and b indicate the start and end genomic coordinates, respectively. Accounting for all exon coordinates, however, can result in many near-equivalent duplication events due to overlapping coordinates between exons. Thus, Exonize restricts the search to a set of representative exons . Each representative exon in must be longer than the nucleotide length threshold l and no pair of representative exons overlap more than the fractional overlap cutoff . The length criterion l is intended to exclude false positives from short, repetitive regions. Setting restricts to entirely non-overlapping exons. Setting admits all exons longer than l to .
To generate , the set of exons from all transcripts in g are clustered according to Supplementary Algorithm S1. The fractional overlap between exons is calculated relative to the length of the longer exon (Definition 1). For each cluster of overlapping exons, the shortest exon is taken to be the representative, maximizing alignment coverage during the search step. To ensure identical cluster formation between runs, intervals are sorted first by the lower boundary and then by the upper boundary.
Definition 1.For any pair of closed intervals , where refers to the set of closed intervals in the natural domain, the minimum interval overlap ratio is defined as:
where is the length of the interval.
Finding putative exon duplications
Exonize queries sequence similarity between representative exons and the genes within which they are situated using tblastx (Camacho et al. 2009). Accepted regions of sequence similarity from local searching, or matches, must satisfy an E-value cutoff , taken from the pairwise search, and a query coverage cutoff , where denotes the minimum fraction of the query exon (a representative exon) that must be aligned. Matches where the fractional overlap of the query exon to itself is greater than or equal to , the self-hit threshold, are considered self-matches and are ignored. The query coverage cutoff determines how a match is classified (Fig. 1B): A match is considered full-length if the query exon overlaps at least a fraction of both its length and the length of an annotated exon (not necessarily a representative exon). A partial match occurs when either the query exon matches within an annotated exon or completely encompasses an annotated exon. In either case, is only satisfied for either the query exon or the annotated exon. An inter-boundary match occurs when the alignment extends beyond of the length of the query exon but less than of the annotated exon length and encompasses both exonic and intronic regions. Finally, an intronic match occurs when a query exon aligns to an intronic region by more than .
Lower values of yield more matches classified as full-length, whereas higher values of yield more matches classified as partial. Based on the above criteria, full-length matches (which correspond to putative full-exon duplication events) are always reciprocal, whereas all other match types are non-reciprocal. As with representative exons, overlaps between matches are possible.
Reconciliation of target intervals
Let M represent the set of matches associated with g, where each match is a pair of closed intervals of the form , with and corresponding to the query and target genomic intervals, respectively. To address overlaps between distinct targets within the same genomic regions, a reconciled target of the target interval is introduced. The reconciled intervals are found by clustering the target intervals using Supplementary Algorithm S1, subject to an overlap threshold , and selecting a representative interval from the cluster. Clusters composed of full-length matches are reconciled against the annotated exon coordinates that map to the target coordinates. Other clusters are reconciled against the target coordinates of the most significant match within the cluster. To update the target reading frame, the three translated target sequences are aligned pairwise to the translated query exon sequence (for which the reading frame is known) using Muscle5 (Edgar 2022). The reading frame with the highest alignment identity is selected. Reconciled matches may overlap.
Cases of full-exon duplications
The global search is restricted to identifying full-length exon duplication events. In this approach, pairs of representative exons are aligned using Muscle5 (Edgar 2022) if the ratio of the shorter exon to the longer exon is greater than . Alignments are performed on both DNA and protein sequences. Matches are retained in M if the protein alignment identity satisfies the cutoff and the fraction of aligned positions satisfies the cutoff . The global alignment approach allows for more sensitive detection of full-length exon duplications that may have been missed by the local search. In the event that a full-exon duplication event is classified as partial by the local search but satisfies the global requirements here, it is taken to be a full-length duplication event.
Identification of exon families with match graphs
Within a gene, multiple rounds of exon duplication can result in families of related exons, which Exonize handles using a graph-based approach. Let be a directed graph encoding the union of global and local matches M associated with g. The set of vertices V consists of the intervals of the representative exons and reconciled targets. The set of ordered edges corresponds to the matches themselves. For local matches, the natural notion of order in derives from the query-to-target directed relationship. Matches detected through the global search are considered reciprocal and are therefore represented with bidirectional edges. Two distinct vertices are said to be connected if there exists a match directed from the query to the target . To resolve reciprocal matches, we consider the undirected version of D. In this case, G may form a single connected component or be composed of multiple connected subgraphs. We refer to these subgraphs as expansions because they arise from repeated exon duplication events. The count of expansions n estimates the number of distinct exons in g that have been duplicated, and the size of each expansion, denoted as k, relates to the number of duplication events within an expansion, taken to be (i.e. assuming that no multi-exon duplication occurred). Additionally, we define a full-length expansion as an expansion composed only of full-length matches. Figure 1C illustrates the match graph for a gene with two expansions.
Tandemness of exon duplication events
A pair of exons is said to be tandem if they are adjacent along the gene sequence; that is, if no complete exon lies between them. Similarly, the events within a full-length expansion are considered tandem if they involve exons that form a sequence of adjacent exon duplications. We assess the adjacency of the events within a full-length expansion by first sorting the events based on the start and end coordinates. We then construct a sequence of predecessor-successor pairs and categorize them based on whether they are tandem or not. This is done by clustering the set of coding exons associated with the gene using Supplementary Algorithm S1 with an overlapping cutoff of 0 and verifying whether the predecessor and successor coordinates in each pair belong to consecutive clusters.
Exon duplication interdependence across transcripts
A duplication event can be classified based on its interdependence across the transcripts of g (Fig. 1D). The obligate scenario occurs when both duplicates are found within all transcripts; the exclusive case occurs when no transcript contains both duplicates; the flexible case occurs when each transcript includes at least one duplicate but without showing an exclusive or obligate pattern. Lastly, duplicates are classified as optional when there is at least one transcript that lacks both duplicates. Optional cases can be further subdivided into obligate, exclusive, or flexible depending on the relationships between the duplicates in the transcripts in which they are present. This classification scheme extends to groups of duplicates as well, as in the case of full-length expansions.
Configuration
Program usage and available options are detailed in the user manual. Exonize outputs an SQLite3 database, which provides type-safe and efficient post-processing. The structure of the database is illustrated in Supplementary Figure S1. In addition, users have the option to export a simplified version of the output in CSV format.
Implementation and computational aspects
Exonize is an open-source command-line tool available for installation via the Python Package Index (PyPI.org). The source code, the user manual, a sample dataset, and a tutorial on how to use the exonize_analysis module are all accessible on GitHub. To run Exonize, local installations of BLAST+ (Camacho et al. 2009), Muscle5 (Edgar 2022), and SQLite are required. Parallel processing is used to speed up the search and reconciliation steps.
Validation
The performance of Exonize was validated using both simulated data and a manually curated dataset of exon duplications in the human genome (GENCODE version 33) (Martinez Gomez et al. 2021). Simulated exon duplication events were inserted into the human Y chromosome (Supplementary Section 3.1). Each simulated dataset included 445 full-length exon duplication events and 420 intronic events. Five datasets were constructed to cover a range of evolutionary distances, from 0.2 to 2.0 expected mutations per site. For full-length duplications, Exonize recovered 97% of events, even at high evolutionary distances. The majority of intronic events, 84% in total, were also recovered. The missed intronic events were mainly due to the performance of BLAST+ (Camacho et al. 2009) (see Supplementary Figure S2). Among the 174 full-length exon duplication events in the human genome proposed in (Martinez Gomez et al. 2021), Exonize successfully recovered 64% of events in total (Supplementary Figure S3) and 98% of events that were in principle detectable based on the amino acid identity cutoff used. Although a less stringent search recovered 86% of events, it was accompanied by a large increase in likely false positives. A sensitivity analysis of the local search performed on the human genome (GRCh38.p14) using different parameter settings can be found in Supplementary Table S1.
Results
Application of Exonize to 20 unmasked eukaryote genomes (Table 1) obtained from the Ensembl database (release 114) (Dyer et al. 2025) reveals full-exon duplication events within 4%–7% of vertebrate genes. Significantly more genes were identified as having partial, inter-boundary, and intronic exon duplication events. Lower eukaryotes, on the other hand, generally had fewer duplication events, and no full-exon duplication events were detected in yeast.
Conclusion
We introduce Exonize, a tool for identifying and classifying exon duplications. Using a simple heuristic approach, Exonize can detect and handle redundant matches that arise from reciprocal and overlapping duplications. This tool facilitates systematic studies of exon duplications in gene and protein evolution.
Supplementary Material
vbaf177_Supplementary_Data
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Aspden JL , Wallace EWJ, Whiffin N. Not all exons are protein coding: addressing a common misconception. Cell Genom 2023;3:100296.37082142 10.1016/j.xgen.2023.100296 PMC 10112331 · doi ↗ · pubmed ↗
- 2Camacho C , Coulouris G, Avagyan V et al BLAST+: architecture and applications. BMC Bioinform 2009;10:421–9.10.1186/1471-2105-10-421PMC 280385720003500 · doi ↗ · pubmed ↗
- 3Dyer SC , Austine-Orimoloye O, Azov AG et al Ensembl 2025. Nucleic Acids Res 2025;53:D 948–D 957.39656687 10.1093/nar/gkae 1071 PMC 11701638 · doi ↗ · pubmed ↗
- 4Eck RV , Dayhoff MO. Evolution of the structure of ferredoxin based on living relics of primitive amino acid sequences. Science 1966;152:363–6.17775169 10.1126/science.152.3720.363 · doi ↗ · pubmed ↗
- 5Edgar RC. Muscle 5: high-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny. Nat Commun 2022;13:6968.36379955 10.1038/s 41467-022-34630-w PMC 9664440 · doi ↗ · pubmed ↗
- 6Ivanov TM , Pervouchine DD. Tandem exon duplications expanding the alternative splicing repertoire. Acta Naturae 2022;14:73–81.35441045 10.32607/actanaturae.11583 PMC 9013439 · doi ↗ · pubmed ↗
- 7Letunic I , Copley RR, Bork P. Common exon duplication in animals and its role in alternative splicing. Hum Mol Genet 2002;11:1561–7.12045209 10.1093/hmg/11.13.1561 · doi ↗ · pubmed ↗
- 8Martinez Gomez L , Pozo F, Walsh TA et al The clinical importance of tandem exon duplication-derived substitutions. Nucleic Acids Res 2021;49:8232–46.34302486 10.1093/nar/gkab 623PMC 8373072 · doi ↗ · pubmed ↗
