Orthologs from maxmer sequence context
Kun Gao, Jonathan Miller

TL;DR
This paper introduces a fast, efficient method using short-range maximal matches to identify orthologs across whole genomes, matching known orthologs and discovering new ones with high accuracy.
Contribution
It presents a novel approach leveraging non-embedded maximal matches for ortholog detection, reducing computational time and improving ortholog identification accuracy.
Findings
Recapitulates most exact matches of traditional alignment methods.
Recovers high-confidence orthologs with high sensitivity and specificity.
Identifies putatively novel orthologs not found in existing databases.
Abstract
Context-dependent identification of orthologs customarily relies on conserved gene order or whole-genome sequence alignment. It is shown here that short-range context--as short as single maximal matches--also provides an effective means to identify orthologs within whole genomes. On pristine (un-repeatmasked) mammalian whole-genome assemblies we perform a genome "intersection" that in general consumes less than one thirtieth of the computation time required by commonly used methods for whole-genome alignment, and we extract "non-embedded maximal matches," maximal matches that are not embedded into other maximal matches, as potential orthologs. An ortholog identified via non-embedded maximal matches is analogous to a "positional ortholog" or a "primary ortholog" as defined in previous literature; such orthologs constitute homologs derived from the same direct ancestor whose ancestral…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics · RNA and protein synthesis mechanisms
