Exhaustive Exact String Matching: The Analysis of the Full Human Genome
Konstantinos F. Xylogiannopoulos

TL;DR
This paper introduces Ex2SM, a novel, string-agnostic methodology capable of detecting all repeated substrings in the human genome, including those up to 50 characters long, surpassing the limitations of existing algorithms.
Contribution
The paper presents Ex2SM, a new pipeline that detects all repeated strings in biological sequences without prior string input, handling complex and large-scale data efficiently.
Findings
Detected all repeated strings up to 50 characters in the human genome.
Demonstrated the method's ability to handle exponential permutations.
Showcased the algorithm's superiority over existing methods in complexity and scope.
Abstract
Exact string matching has been a fundamental problem in computer science for decades because of many practical applications. Some are related to common procedures, such as searching in files and text editors, or, more recently, to more advanced problems such as pattern detection in Artificial Intelligence and Bioinformatics. Tens of algorithms and methodologies have been developed for pattern matching and several programming languages, packages, applications and online systems exist that can perform exact string matching in biological sequences. These techniques, however, are limited to searching for specific and predefined strings in a sequence. In this paper a novel methodology (called Ex2SM) is presented, which is a pipeline of execution of advanced data structures and algorithms, explicitly designed for text mining, that can detect every possible repeated string in multivariate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · RNA and protein synthesis mechanisms
