Locating regions in a sequence under density constraints
Benjamin A. Burton, Mathias Hiron

TL;DR
This paper introduces efficient algorithms for locating sequence regions with density constraints, significantly improving speed and memory usage over previous methods, enabling analysis of much longer biological sequences.
Contribution
The authors develop the first linear-time algorithm for finding the longest such substring and faster algorithms for related problems, surpassing prior O(n log n) solutions.
Findings
Algorithms run in O(n) and O(n log log n) time, faster than previous methods.
Practical tests show reduced memory use and ability to process longer sequences.
New algorithms outperform existing solutions in speed and efficiency.
Abstract
Several biological problems require the identification of regions in a sequence where some feature occurs within a target density range: examples including the location of GC-rich regions, identification of CpG islands, and sequence matching. Mathematically, this corresponds to searching a string of 0s and 1s for a substring whose relative proportion of 1s lies between given lower and upper bounds. We consider the algorithmic problem of locating the longest such substring, as well as other related problems (such as finding the shortest substring or a maximal set of disjoint substrings). For locating the longest such substring, we develop an algorithm that runs in O(n) time, improving upon the previous best-known O(n log n) result. For the related problems we develop O(n log log n) algorithms, again improving upon the best-known O(n log n) results. Practical testing verifies that our new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
