Quasi-Linear-Time Algorithm for Longest Common Circular Factor
Mai Alzamel, Maxime Crochemore, Costas S. Iliopoulos, Tomasz, Kociumaka, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszy\'nski, Tomasz, Wale\'n, Wiktor Zuba

TL;DR
This paper presents a novel algorithm that efficiently computes the longest common circular factor between two strings, extending the classic longest common factor problem with a cyclic shift consideration, in near-linear time.
Contribution
The paper introduces the LCCF problem and provides the first quasi-linear time algorithm to solve it, advancing string similarity measures.
Findings
LCCF can be computed in O(n log^5 n) time.
The algorithm extends classic string matching techniques.
LCCF serves as a new similarity measure for strings.
Abstract
We introduce the Longest Common Circular Factor (LCCF) problem in which, given strings and of length , we are to compute the longest factor of whose cyclic shift occurs as a factor of . It is a new similarity measure, an extension of the classic Longest Common Factor. We show how to solve the LCCF problem in time.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Quasi-Linear-Time Algorithm
for Longest Common Circular Factor
Mai Alzamel
Department of Informatics, King’s College London, London, UK
[mai.alzamel,maxime.crochemore,costas.iliopoulos]@kcl.ac.uk
Maxime Crochemore
Department of Informatics, King’s College London, London, UK
[mai.alzamel,maxime.crochemore,costas.iliopoulos]@kcl.ac.uk
Costas S. Iliopoulos
Department of Informatics, King’s College London, London, UK
[mai.alzamel,maxime.crochemore,costas.iliopoulos]@kcl.ac.uk
Tomasz Kociumaka
Institute of Informatics, University of Warsaw, Warsaw, Poland
[kociumaka,jrad,rytter,jks,walen,w.zuba]@mimuw.edu.pl
Jakub Radoszewski
Institute of Informatics, University of Warsaw, Warsaw, Poland
[kociumaka,jrad,rytter,jks,walen,w.zuba]@mimuw.edu.pl
Wojciech Rytter
Institute of Informatics, University of Warsaw, Warsaw, Poland
[kociumaka,jrad,rytter,jks,walen,w.zuba]@mimuw.edu.pl
Juliusz Straszyński
Institute of Informatics, University of Warsaw, Warsaw, Poland
[kociumaka,jrad,rytter,jks,walen,w.zuba]@mimuw.edu.pl
Tomasz Waleń
Institute of Informatics, University of Warsaw, Warsaw, Poland
[kociumaka,jrad,rytter,jks,walen,w.zuba]@mimuw.edu.pl
Wiktor Zuba
Institute of Informatics, University of Warsaw, Warsaw, Poland
[kociumaka,jrad,rytter,jks,walen,w.zuba]@mimuw.edu.pl
Abstract
We introduce the Longest Common Circular Factor (LCCF) problem in which, given strings and of length , we are to compute the longest factor of whose cyclic shift occurs as a factor of . It is a new similarity measure, an extension of the classic Longest Common Factor. We show how to solve the LCCF problem in time.
1 Introduction
We introduce a new variant of the Longest Common Factor (LCF) Problem, called the Longest Common Circular Factor (LCCF) Problem. In the LCCF problem, given two strings and , both of length , we seek for the longest factor of whose cyclic shift occurs as a factor of . The length of the LCCF is a new string similarity measure that is 2-approximated by the length of the LCF. We show that the exact value of LCCF can be computed efficiently.
A linear-time solution to the LCF problem is one of the best-known applications of the suffix tree [2]. Just as the LCF problem was an extension of the classical pattern matching, the LCCF can be seen as an extension of the circular pattern matching problem. The latter can still be solved in linear time using the suffix tree and admits a number of efficient solutions based on practical approaches [4, 9, 16, 20, 25, 28], also in the approximate variant [6, 7, 17, 19], as well as an indexing variants [3, 20, 21], and the problem of detecting various circular patterns [26]. The LCCF problem is further related to the notion of unbalanced translocations [8, 10, 27, 29, 30].
One can formally state the problem in scope as follows.
Longest Common Circular Factor (LCCF)
Input: Two strings and of length each.
Output: A longest pair of factors, of and of , for which there exist strings and such that and ; we denote .
Our main result is the following.
Theorem 1** (Main Result).**
The Longest Common Circular Factor problem on two strings of length can be solved in time and space.
Our approach.
We apply techniques from the area of internal pattern matching (in case and are not highly periodic; Section 3) and Lyndon roots (otherwise; Section 4). The LCCF problem is reduced to finding configurations satisfying conjunction of four conditions of type , where is the set of occurrences matching a factor .
Each configuration can be decomposed into two subconfigurations (pairs of consecutive fragments), one in and one in . We guarantee that the number of subconfigurations is nearly linear so that we can compute them all for both and . Then, the task reduces to finding two subconfigurations which agree (produce a full configuration) and constitute an optimal solution. This is done using geometric techniques in Section 6. Each condition can be seen as membership of a point in a range since form an interval in the suffix array. This gives a reduction of the LCCF problem to an intersection problem for 4D-rectangles. The latter task is solved efficiently using a sweep line algorithm.
2 Preliminaries
We consider strings over an integer alphabet . If is a string, then by we denote its length and by its characters. By we denote a fragment of between the th and th character, inclusively. We also denote this fragment by , and we define as well as . If , then is a prefix, and if , it is a suffix of . Fragments and are consecutive if ; we then also say that follows .
The string that corresponds to the fragment is a factor of . We say that two fragments match if the corresponding factors are the same. Let us note that a fragment can be represented by its endpoints in space; this representation can also be used to specify the corresponding factor.
By we denote the reversal of the string . By we denote the shortest period of . A string is called (weakly) periodic if its shortest period satisfies . Fine and Wilf’s periodicity lemma [15] asserts that if a string has periods and such that , then is also a period of .
We define the type of a (non-empty) string as . We denote by the longest common circular factor of and such that , , , and . We also say that it is the type- LCCF. Our strategy is to compute independently for every pair satisfying , and afterwards we report the longest alternative (over all pairs ) and of the LCF of and (corresponding to or ) as the final result.
2.1 Synchronizing Functions
Let be a string of length . By we denote the set of fragments of of length and by we denote the set of fragments of of length that have a period . By and , we denote the subsets of comprising of fragments contained within a longer fragment of .
Definition 2** (Kociumaka [23, Definition 4.2.1]).**
A function is called -synchronizing (see Fig. 1) if it satisfies the following conditions for each fragment :
- •
If , then ;
- •
If , then ;
- •
If two fragments match, and , then for the same . In other words, if and , then and .
The elements of for are called here basic fragments.
Example 3*.*
Consider a special case of a cube-free word . Let be the identifier of a -basic fragment of in the Dictionary of Basic Factors [22] and be a permutation of all (linearly many) -basic identifiers. Each identifier is an integer in the range . For a fragment of size , we could define as the first -basic fragment (from the left) with minimal . Then satisfies the conditions of a synchronizing function. If we take a random permutation , then it has other useful properties in expectation. This approach can also be derandomized and generalized to arbitrary texts, as shown in the following lemma.
For a function on fragments of length , by we denote one plus the number of positions such that . The following fact provides an efficient construction of a -synchronizing function with a small number of steps. It was presented as Lemma 4.4.9 in [23]; its randomized version originates from [24].
Lemma 4** ([23, Lemma 4.4.9]).**
For a string of length and , in time one can construct a -synchronizing function (stored in an array) such that .
The set of -synchronizers of a fragment is .
3 Nonperiodic-Nonperiodic Case
We say that a string of type is highly periodic if . We consider now such that , , is of type , is of type , and neither nor is highly periodic. We call it the nonperiodic-nonperiodic case.
For a pair of fragments , by we denote a condition which states that is followed by a fragment that matches and by we denote a condition which states that follows a fragment that matches . We say that two pairs of consecutive fragments, in and in , agree if and only if
[TABLE]
We reduce the LCCF problem in this case to the following abstract problem; see Fig. 2.
Fragment-Families-Problem
Input: Two collections and of pairs of consecutive fragments of string of length , with
Output: and that agree and maximize
Let us define basic fragments called the left -window and the right -window:
[TABLE]
For a string we introduce the following set of -synchronizers:
[TABLE]
By we denote the singleton of the leftmost fragment in or an empty set if there is no such fragment. For fragments of and position , by
[TABLE]
we denote a pair of consecutive fragments of that are delimited by the starting positions of and and the index . We then define the set of “candidates” (see Fig. 3):
[TABLE]
Using the terminology an informal scheme of a general algorithm is as follows:
Algorithm Compute-
Compute the sets 2. 2.
Find two pairs
which agree and have maximum 3. 3.
return
Lemma 5** (Correctness for Nonperiodic-Nonperiodic Case).**
The problem in the nonperiodic-nonperiodic case can be reduced to Fragment-Families-Problem for and .
Proof.
Take a pair of fragments of and of such that is an occurrence of a factor and is an occurrence of a factor such that is of type and is of type and none of them is highly periodic. Denote by and the consecutive fragments of corresponding to and , and similarly by and the consecutive fragments of corresponding to and , and let and . Let and be the leftmost fragments of and of length that are not highly periodic. Take a -synchronizer (it belongs to and starts at position of ), and -synchronizer (it belongs to , and, by the synchronization property, starts at position of ). Symmetrically, let and be the leftmost fragments of and of length that are not highly-periodic. The -synchronizers and belong to and , respectively, and start at the same position of and . This means that there exists a pair , such that and , and a pair , such that and , which agree as is followed by , which matches , is preceded by which matches , is followed by which matches and is preceded by which matches .
Conversely for two pairs that agree there exists a factor in string matching and a factor matching in . Thus, there is a one-to-one correspondence between pairs that agree and fragments of strings of right type that are cyclic shifts. Hence by finding two pairs that agree and maximize we find a solution to problem. ∎
Lemma 6** (Complexity for Nonperiodic-Nonperiodic Case).**
In the nonperiodic-nonperiodic case the LCCF problem can be reduced in time to instances of Fragment-Families-Problem with .
Proof.
We use the reduction of Lemma 5.
Claim 7**.**
For a string of length and integer ,
[TABLE]
Consequently, for any integers , .
Proof.
For a given , each -fragment can belong to only sets . Moreover, by Lemma 4. This yields the first part of the claim.
Finally, . ∎
By the claim, in every instance of Fragment-Families-Problem.
For each , we compute a -synchronizing function using Lemma 4. This takes time in total. The sets and (and, thus, ) can be computed for any in linear time using a sliding window approach. From them we can compute any set straight from definition. By the claim, over all the complexity is . ∎
4 Periodic-Periodic Case
We consider now such that , , is of type , is of type , and both and are highly periodic.
Recall that a Lyndon string is a string that is lexicographically smaller than all its proper suffixes. If is a weakly periodic string with the shortest period , then its Lyndon root is the Lyndon string that is a cyclic shift of . A Lyndon representation of is then such that where , ; see [12]. Lyndon strings have the following synchronization property that follows from the periodicity lemma: if is a Lyndon string, then it has exactly two occurrences in ; see [11].
For a string , by and we denote the longest highly periodic prefix and suffix of . Let us start with the following simple observation; see Fig. 4.
Observation 8**.**
Let and be two strings for which the strings and have the same Lyndon root . Then the longest suffix of that is also a prefix of has length greater than .
For a fragment denote by the set of fragments in corresponding to the first/second/last occurrence of Lyndon root in and by the set of fragments in corresponding to the first/second to last/last occurrence of Lyndon root in . We can redefine (see Fig. 5)
[TABLE]
Lemma 9** (Correctness for Periodic-Periodic Case).**
The problem in the periodic-periodic case can be reduced to Fragment-Families-Problem for the redefined sets and .
Proof.
Take a pair of fragments of and of such that is an occurrence of a factor and is an occurrence of a factor () such that is of type , is of type , and both and are highly periodic. Denote by and the consecutive fragments of corresponding to and , and similarly by and the consecutive fragments of corresponding to and , and let and .
Let and . Note that is a highly periodic suffix of and that has the same period as (a different period would contradict the periodicity lemma). Symmetrically, is a highly periodic prefix of and has the same period as . Let be the Lyndon root of , and observe that is also the Lyndon root of and . By 8, we have
[TABLE]
as otherwise we would be able to find a Common Circular Factor of type that is longer by , thus contradicting our choice of ad .
If , then the first occurrence of in is also the first or second occurrence of Lyndon root in . This is due to the synchronization property of Lyndon strings. Moreover, the first in is also the first occurrence of in . On the other hand, if , then the last occurrence of in is the last or the second to last occurrence of in , whereas the last occurrence of in is also the last occurrence of in . In either case, and contain a pair of corresponding occurrences of , which belong to and , respectively; see Fig. 6.
As the same reasoning can be applied to and , there exists a pair and a pair , which corresponds to our choice of occurrences of Lyndon roots. These pairs agree and , thus Fragment-Families-Problem will find a solution at least that good.
The converse direction is identical to the one from the proof of Lemma 5. ∎
We proceed with an efficient implementation. A run in string is a maximal weakly periodic fragment with a given period . We use 2-period queries which, given a weakly periodic fragment of a string, compute its shortest period and the run of the same period it belongs to. Such queries can be answered in time after -time preprocessing [24] (for a simplified solution, see [5]). Let us also recall that the Lyndon representation of a run can be computed in constant time after linear-time preprocessing [12].
Lemma 10** (Complexity for Periodic-Periodic Case).**
In the periodic-periodic case the LCCF problem can be reduced in time to instances of Fragment-Families-Problem with .
Proof.
For any integers , we can in time answer which of the fragments , , , contain highly periodic prefixes/suffixes and find the Lyndon representation of the longest such prefixes/suffixes using the following claim.
Claim 11**.**
After -time preprocessing, and for a type- fragment of or as well as its Lyndon representation can be computed in time.
Proof.
To compute , it suffices to ask a 2-period query for (see [24, 12]), compute the Lyndon representation of the resulting run (see [12]), and then trim its Lyndon representation to . A symmetric solution works for . ∎
This allows us to compute the sets and . Finally, . ∎
5 Nonperiodic-Periodic Case
Finally we consider the case that such that , , is of type , is of type , and either or is highly periodic. This case can be reduced to Fragment-Families-Problem directly by combining the techniques of the previous two sections.
Lemma 12** (Correctness for Nonperiodic-Periodic Case).**
The problem in the nonperiodic-periodic case can be reduced to Fragment-Families-Problem .
Proof.
In the proofs of Lemmas 5 and 9 the and parts of the factors were considered separately. Hence it is enough to define as
[TABLE]
and depending on which fragment is highly periodic consider one part for and the other for . ∎
Lemma 13** (Complexity for Nonperiodic-Periodic Case).**
In the nonperiodic-periodic case the LCCF problem can be reduced in time to instances of Fragment-Families-Problem with .
Proof.
The families can be computed combining the methods from Lemmas 6 and 10, obtaining the desired complexities and sizes. ∎
6 Solution to Fragment-Families-Problem
In this section we show how to solve the Fragment-Families-Problem for a string of length by a reduction to intersecting special 4-dimensional rectangles.
First we give a geometric interpretation of two predicates:
- •
factor has an occurrence in starting at position (is a prefix of the suffix starting at position )
- •
and has an occurrence ending at position (is a suffix of a prefix ending at position )
to the membership of in a corresponding subinterval of .
Let us recall that the suffix array of string , is a permutation of such that for every . By let us denote the set of starting positions of occurrences of in . Our geometric interpretation is possible due to the following well known fact (see [11]).
Observation 14**.**
* is a set of consecutive elements in .*
Let be the set of ending positions of occurrences of in . We also use the , notation for fragments which means operations on corresponding factors.
Observation 15**.**
* ***
A fragment is a prefix/suffix of the suffix starting (prefix ending) at position iff , respectively. 2. 2.
* and .*
We define a -rectangle () as a Cartesian product of closed intervals, such that at least of them are singletons. E.g., is a 4-rectangle. In other words, a -rectangle is an isothetic hyperrectangle of dimension at most 2.
By and we denote the subintervals of that correspond to the intervals of in the suffix array and of in the (analogously defined) prefix array of , , respectively, as stated in 14. ( is a permutation of such that for every .) For pairs and of consecutive fragments we denote:
[TABLE]
Now 15.2 implies the following.
Observation 16**.**
Two pairs of consecutive fragments , agree iff .
Two -rectangles and are called compatible if, for each , or is a singleton. Let us note that the 4-rectangles in the above observation are compatible.
6.1 Intersecting 4D Rectangles
We consider two families of 4-rectangles with weights and wish to find a pair of intersecting rectangles, one per family, with maximum total weight. The general problem of finding such an intersection of two families of weighted hyperrectangles in dimensions can be solved in time by an adaptation of a classic approach [13]. Below we consider a special variant of the problem that has a much more efficient solution.
Max-Weight Intersection of Compatible Rectangles in 4D
Input: Two families and of 4-rectangles in with integer weights containing rectangles in total, such that each and are compatible
Output: Check if there is an intersecting pair of 4-rectangles and and, if so, compute the maximum total weight of such a pair
A very similar problem was considered as Problem 3 in [18] for an arbitrary . The sole difference is that the weight of an intersection of two -rectangles and in that problem was the maximum -norm of a point in . A solution to Problem 3 for in the case that the 4-rectangles are compatible working in time and space was given as [18, Lemma 5.8]. The algorithm presented in that lemma actually solves the Max-Weight Intersection of Compatible Rectangles in 4D problem and applies it for specific weight assignment of the 4-rectangles on the input. It uses hyperplane sweep and a variant of an interval stabbing problem. Henceforth we will use the following result.
Fact 17** ([18, see Lemma 5.8]).**
Max-Weight Intersection of Compatible Rectangles in 4D* can be solved in time and space.*
6.2 Algorithm for Fragment-Families-Problem
Let us recall that the suffix tree of string , , is a compacted trie of all the suffixes of . It can be computed in time (see [14]) and reading the suffixes of in its preorder traversal yields the suffix array of . An efficient implementation of 14 is known; see [1].
Lemma 18**.**
After -time preprocessing, for a given fragment of one can compute in time the sets and .
Proof.
It suffices to show how to compute . For every explicit node of we can compute the interval of elements of that are located in its subtree. This can be done in a bottom-up order in time.
A weighted ancestor query in , given a terminal node and positive integer , returns the ancestor of located at depth (being an explicit or implicit node). Such queries (for any tree of nodes with positive integer weights of edges) can be answered in time after -time preprocessing; see [1].
A weighted ancestor query can be used to, given a fragment of , compute the corresponding (explicit or implicit) node of . The interval stored in the nearest explicit descendant of equals . ∎
We are now ready to show a solution to Fragment-Families-Problem.
Lemma 19**.**
Fragment-Families-Problem* can be solved in time and space.*
Proof.
We construct families and of weighted 4-rectangles. For every , we add to with weight . For every , we add to with weight . By 16, the solution to Max-Weight Intersection of Compatible Rectangles in 4D for and is the solution to .
Note that we have and . Using Lemma 18 and a linear-time algorithm for constructing and (and and ) [14], computation of 4-rectangles , can be done in time after -time preprocessing. This gives time in total. Finally, Max-Weight Intersection of Compatible Rectangles in 4D can be solved in time and space. ∎
As a consequence of all the previous Correctness and Complexity lemmas and the above lemma we obtain the main result.
See 1
7 Conclusions
We have presented an -time algorithm for computing the Longest Common Circular Factor (LCCF) of two strings of length . Let us recall that the Longest Common Factor (LCF) of two strings can be computed in time. We leave an open question if the LCCF problem can also be solved in linear time.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Amihood Amir, Gad M. Landau, Moshe Lewenstein, and Dina Sokol. Dynamic text and static pattern matching. ACM Transactions on Algorithms , 3(2):19, 2007. doi:10.1145/1240233.1240242 . · doi ↗
- 2[2] Alberto Apostolico, Maxime Crochemore, Martin Farach-Colton, Zvi Galil, and S. Muthukrishnan. 40 years of suffix trees. Communications of the ACM , 59(4):66–73, 2016. doi:10.1145/2810036 . · doi ↗
- 3[3] Tanver Athar, Carl Barton, Widmer Bland, Jia Gao, Costas S. Iliopoulos, Chang Liu, and Solon P. Pissis. Fast circular dictionary-matching algorithm. Mathematical Structures in Computer Science , 27(2):143–156, 2017. doi:10.1017/S 0960129515000134 . · doi ↗
- 4[4] Md. Aashikur Rahman Azim, Costas S. Iliopoulos, Mohammad Sohel Rahman, and M. Samiruzzaman. A fast and lightweight filter-based algorithm for circular pattern matching. In Pierre Baldi and Wei Wang, editors, 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2014 , pages 621–622. ACM, 2014. doi:10.1145/2649387.2660804 . · doi ↗
- 5[5] Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. The “runs” theorem. SIAM Journal on Computing , 46(5):1501–1514, 2017. doi:10.1137/15M 1011032 . · doi ↗
- 6[6] Carl Barton, Costas S. Iliopoulos, and Solon P. Pissis. Fast algorithms for approximate circular string matching. Algorithms for Molecular Biology , 9:9, 2014. doi:10.1186/1748-7188-9-9 . · doi ↗
- 7[7] Carl Barton, Costas S. Iliopoulos, and Solon P. Pissis. Average-case optimal approximate circular string matching. In Adrian-Horia Dediu, Enrico Formenti, Carlos Martín-Vide, and Bianca Truthe, editors, Language and Automata Theory and Applications, LATA 2015 , volume 8977 of LNCS , pages 85–96. Springer, 2015. doi:10.1007/978-3-319-15579-1_6 . · doi ↗
- 8[8] Domenico Cantone, Simone Faro, and Arianna Pavone. Sequence searching allowing for non-overlapping adjacent unbalanced translocations. Ar Xiv preprint, 2018. ar Xiv:1812.00421 .
