Note on the Infiniteness and Equivalence Problems for Word-MIX Languages
Ryoma Sin'ya

TL;DR
This paper presents a decidable, graph-based method to determine the infiniteness and equivalence of certain word languages defined by equal subword counts, providing a self-contained proof without relying on constrained automata theory.
Contribution
It introduces a new, self-contained graph-structural characterization for deciding infiniteness and equivalence of specific word languages, simplifying previous complex automata-based approaches.
Findings
Decidable graph-structural characterization of language infiniteness
Decidable equivalence criteria for the languages
Self-contained proof avoiding automata theory
Abstract
In this note we provide a (decidable) graph-structural characterisation of the infiniteness of , where is the set of all words that contain the same number of subword occurrences of parameter words . We also provide the decidable characterisation of the equivalence for those languages. Although those two decidability results are also obtained from more general known decidability results on unambiguous constrained automata, this note tries to give a self-contained (without the knowledge about constrained automata) proof of the decidability.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicssemigroups and automata theory · DNA and Biological Computing · Logic, programming, and type systems
11institutetext: Akita University
11email: [email protected]
Note on the Infiniteness and Equivalence Problems for Word-MIX
Languages
Ryoma Sin’ya
Abstract
In this note we provide a (decidable) graph-structural characterisation of the infiniteness of , where is the set of all words that contain the same number of subword occurrences of parameter words . We also provide the decidable characterisation of the equivalence for those languages. Although those two decidability results are also obtained from more general known decidability results on unambiguous constrained automata, this note tries to give a self-contained (without the knowledge about constrained automata) proof of the decidability.
1 Introduction
Counting occurrences of letters in words is a major topic in formal language theory and much ink has been spent on this topic. Measuring the counting ability of a language class is in this topic. For example, Joshi et al. [1] suggested that the language MIX should not be in the class of so-called mildly context-sensitive languages since it allows too much freedom in word order, so that relations between MIX and several language classes have been investigated (e.g., indexed languages [2], range concatenation languages [3], tree-adjoining languages [4], multiple context-free languages [5][6], etc.). The Parikh map is another rich example on this topic [7].
In the recent work [8] by Colbourn et. al., the counting feature of MIX is generalised from the counting letter occurrences to the counting of word occurrences. They considered several problems for languages of the form which we call Word-MIX languages (WMIX for short) in this note. It is interesting that the situation is drastically changed by this generalisation. The decidability of the infiniteness/equivalence turn to be non-trivial: and are finite but is infinite over (example from [8]), and is infinite but is finite over (these two examples appear again in Section 4.1), for example. In addition, while is always deterministic context-free (DCFL), it can also be regular ( is regular, for example) [8]. This kind of generalisation (from letter occurrences to word occurrences) is also considered in the context of the Parikh map [9]. Colbourn et. al. [8] provided a necessary and sufficient condition for and for these languages to be regular, and gave a polynomial time algorithm for testing that condition. The finiteness of is also considered in [8] and they proved that, for any non-empty words , is finite if and only if the alphabet consists of a single letter () and . For more general case, allowing more than two parameter words , they give a sufficient condition for the infiniteness of (Theorem 8 in [8]): if all of have the same length, then is infinite.
For the fully general case, the decidability of both regularity and infiniteness for WMIX languages can be derived from some known results on constrained automata (CA for short), since is always recognised by a deterministic CA, and its regularity and Parikh image are effectively computable [10]. In this note, we provide a self-contained (without the knowledge about constrained automata) description of a decidable, necessary and sufficient condition for the infiniteness of , and give some open problems about the infiniteness. Our proof is based on a combinatorics on walks in the de Bruijn graph. The (-dimensional) de Bruijn graph [11] can track all information of subword occurrences (of length at most ), hence it is a very useful tool for counting subword occurrences and related problems. The de Bruijn graph also played a key role in the proof of Theorem 8 in [8].
The rest of this note consists as follows. In Section 2, we give some preliminary definitions and propositions about words, orders, graphs and walks. Section 3 investigates a simple decomposition method which decomposes a walk into a path and a sequence of cycles. This decomposition is useful for the proof of our main theorem. The main result of this note (Theorem 4.1), which states a decidable characterisation of the infiniteness of a WMIX language, is stated and proved in Section 4, the decidability is explained with two examples in Section 4.1. The decidability of the equivalence for two WMIX languages is also explained in Section 5. We end this note with list of open problems in Section 6.
2 Preliminaries
For a set , we denote by the cardinality of . We write if is an infinite set, and write otherwise. We denote by the set of natural numbers including [math]. We call a mapping multiset over .
2.1 Words and Orders
For an alphabet , we denote the set of all (resp. non-empty) words over by (resp. ). We write (resp. ) the set of all words of length (resp. less than ). For a pair of words , denotes the number of subword occurrences of in
[TABLE]
For words , we define
[TABLE]
and call it the Word-MIX language of parameter words ((k-)WMIX for short). For a word , we denote the set of prefixes and suffixes of by
[TABLE]
and denote the length- () prefix and suffix of by and , respectively.
A quasi order on a set is called well-quasi-order (wqo for short) if any infinite sequence contains an increasing pair with . Let be a quasi order on a set and be a quasi order on a set . The product order is a quasi order on defined by
[TABLE]
Lemma 1 (cf. Proposition 6.1.1 in [12])
Let be a wqo on a set and be a wqo on a set . The product order is again a wqo on .
We list some examples of wqos below:
- (1)
The identity relation on any finite set is a wqo (the pigeonhole principle). 2. (2)
The usual order on is a wqo. 3. (3)
The product order on is a wqo for any (Dickson’s lemma), which is a direct corollary of Lemma 1. 4. (4)
The point-wise order on the multisets ( for all ) over a finite set is a wqo (just a paraphrase of Dickson’s lemma).
2.2 Graphs and Walks
Let be a (directed) graph. We call a sequence of vertices walk (from into in ) if for each , and define the length of as and denote it by . We denote by and the source and the target of . is called an empty walk if . If two walks is connectable (i.e., ), we write for the connecting walk . A non-empty walk is called loop (on ) if . A walk is called path if for every with . A loop is called cycle if is a path. We use the metavariable for a path, and the metavariable for a cycle. For a cycle and , we write for the loop which is an -times repetition of . We denote by and by the set of all walks, paths and cycles in . Note that is infinite in general, but and are both finite if is finite (i.e., ).
The -dimensional de Bruijn graph over is a graph whose vertex set is the set of words of length and the edge set is defined by
[TABLE]
The case is depicted in Fig. 1.
Let be a vertex of . A word induces the walk (where ) in , and we denote it by . Conversely, a walk in induces the word , and we denote it by (see Fig. 1). For words and a walk , we define the following vectors in :
[TABLE]
We notice that the range of the summation in the above definition of does not contain [math], hence if is an empty walk . The next proposition states a basic property of .
Proposition 1
Let and . For any pair of words such that , we have
[TABLE]
where .
Proof
Straightforward induction on the length of .
3 Path-Cycle Decomposition of Walks
In this section, we provide a simple method which decomposes, in left-to-right manner, a walk into a (possibly empty) path and a sequence of cycles (Fig. 2). This decomposition is probably folklore but useful for our main proof in the next section. We also introduce in this section the notion of multi-traces and traces of walks, which play crucial role in the characterisation of the infiniteness and equivalence for WMIX languages.
Let be a graph. For a pair of sequences of cycles , we write for the concatenation . When we simply write for We write for the empty sequence of cycles. For , we denote by for the -th component of , and denote by the number of occurrences of in . For a walk , we denote by the set of all vertices appeared in : .
We then define a decomposition function inductively as follows:
[TABLE]
Conversely, we define a composition (partial) function inductively as follows:
[TABLE]
We list some important properties of and .
Proposition 2
Let be a graph and . Then the followings hold for :
- (1)
* is a path in .* 2. (2)
* is a sequence of cycles in .* 3. (3)
** 4. (4)
, i.e., is the identity function on .
Proof
(1)–(2) are obvious by the definition. (3)–(4) can be shown by an easy induction on the length of .
Proposition 3
Let and . For any in ,
[TABLE]
holds where .
Proof
Straightforward induction on the length of .
Example 1
Consider the complete graph of order and a walk . The result of decomposition is . All intermediate computation step of and are drawn in Fig. 2 (in the figure we denote by a pair for visibility). It is clear that the all conditions in Proposition 2 are satisfied (, in particular).
3.1 Multi-Traces and Traces
For a walk in a graph , we define the multi-trace of a walk as the following multiset over paths and cycles:
[TABLE]
We define the trace of a walk in as the following set of paths and cycles:
[TABLE]
Intuitively, the multi-trace of in is obtained by forgetting the ordering of the decomposition result of , and the trace of is obtained by forgetting the multiplicity from the original multi-trace (see Fig. 3 for the relation).
Since Item 3 in Proposition 2 and Proposition 3 do not depend on the order of a sequence , one can easily observe that the following proposition holds by the definition of .
Proposition 4
Let and . For any in , we have
[TABLE]
and
[TABLE]
For a set , the following lemma states that we can effectively test whether is a trace or not (see Fig. 4 for the intuition).
Lemma 2
Let be a set of paths and cycles in a graph . The followings are equivalent:
- (1)
* is a trace of some walk in .* 2. (2)
* can be written as such that (i) and (ii) for every , where *
4 Characterisation of the Finiteness
For a vector , we define
[TABLE]
Observe that if and only if .
We now ready to state and prove the main result.
Theorem 4.1
Let and . Then the followings are equivalent:
- (1)
* is infinite.* 2. (2)
*There exists a trace that satisfies the following two conditions. By Lemma 2, we can assume without loss of generality that is of the form that satisfies Condition (2) in Lemma 2. *
(balance condition)* there exist positive coefficients , for each , such that*
[TABLE]
(pumping condition)* there exist coefficients , not all zero, such that*
[TABLE]
Proof
The direction is relatively easy. Intuitively, the balance condition ensures the existence of a word such that , , and the pumping condition further ensures is “pumpable” in some sense, which implies the infiniteness of . We prove this intuition. Assume a trace satisfies the balance and pumping conditions and let . Since is a trace, is defined by Lemma 2. Let be positive coefficients that satisfy the balance condition, and be coefficients, not all zero, that satisfy the pumping condition. For each , we define and . Note that, since every cycle has at least length one and by Item 3 in Proposition 2, and hence holds for every . Combining Proposition 1, Proposition 3 and the balance condition, we obtain
[TABLE]
Moreover, by Proposition 1, Proposition 3 and the pumping condition we have
[TABLE]
for any . This means that every distinct word is in , hence is infinite.
We then prove the opposite direction . Assume . Since is infinite and hence it contains an arbitrary long word, we can take an infinite sequence of words from that satisfies , and . Now consider an infinite sequence of multi-traces
[TABLE]
Since the point-wise order on the multisets over any finite set is a wqo (thanks to Dickson’s lemma) and is finite, contains an increasing pair with . Define . We notice that, since and every multi-trace contains exactly one path as its element, and for some path . Since holds for every by definition, we can deduce by Proposition 4 and thus we have
[TABLE]
Let be the number of cycles in and write . Define and for each . Clearly and hold for every by the definition. By Proposition 1 and Proposition 4, we have
[TABLE]
that is, the balance condition is satisfied. In addition, from the above equation we obtain
[TABLE]
because for any such that , if and only if . By Condition ( ‣ 4), coefficients , …, are not all zero, the pumping condition is satisfied. This ends the proof. ∎
4.1 Decidability and Examples
The decision problem whether both balance and pumping condition are satisfied for a given trace in can be reduced into -formula (existential formula) of Presburger arithmetic (see the examples in below). The set of traces in is clearly finite and effectively enumerable (due to Lemma 2), in addition. Thus we obtain the following corollary.
Corollary 1
For all words , it is decidable whether is infinite or not.
Proof
Enumerate possible traces in and check whether there is a trace that satisfies both balance and pumping condition.
Example 2
Consider the language over , and the 2-dimensional de Bruijn graph shown in Fig. 1. We claim that a trace satisfies both balance and pumping condition. One can easily observe that
[TABLE]
and hence the coefficient simultaneously satisfies the two condition stated in (2) of Theorem 4.1. For each , by Proposition 1 the word is in . Hence and .
Example 3
Next consider another language over , and again the 2-dimensional de Bruijn graph shown in Fig. 1. In contrast with Example 2, the trace does not satisfy the balance condition any more (even it still satisfies the pumping condition). We have
[TABLE]
We can formally prove that there is no positive coefficient that satisfies the balance condition, since the existence of such coefficients can be expressed in the following -formula of Presburger arithmetic
[TABLE]
where is a subexpression defined by
[TABLE]
can be algorithmically verified to be not valid since the validity of a first-order formula of Presburger arithmetic is decidable (cf. Section 6.2 of [13]). We can algorithmically verify, by using the same reduction into -formulae of Presburger arithmetic, that no trace in satisfies both balance and pumping condition. Thus by Theorem 4.1.
5 Characterisation of the Equivalence
In the previous section, multi-traces and traces play crucial role for the characterisation of the finiteness. Multi-traces are also important for the characterisation of the equivalence of WMIX languages which is given here. Before stating the main statement, we lift the notion of traces of walks to one of languages. For a language , we define the multi-trace of a language (of order ) as
[TABLE]
The following theorem states that any WMIX language is completely determined by its multi-trace (excluding shorter part ).
Theorem 5.1
Let and . Then if and only if
[TABLE]
and
[TABLE]
Proof
The “only-if”-part is trivial. We prove the “if”-part by contraposition. Assume . Then we can assume that there is some word such that but without loss of generality. If it is clear that
[TABLE]
and the “if”-part holds. Thus we consider the case . Let such that and . We now prove that does not contain any word that has the same multi-trace with (i.e., ; holds in this case). By Proposition 1 and Proposition 4, any subword occurrences in a word is completely determined by its multi-trace. Thus if there is a word in such that , then
[TABLE]
from which we obtain ; this contradicts with the assumption. Therefore we can conclude that
[TABLE]
5.1 Decidability
By using Theorem 5.1, we can obtain an algorithm for deciding the equivalence of two WMIX languages. This algorithm also uses the decidability of Presburger arithmetic, as like the previous algorithm for the infiniteness, but in contrast to the case of inifiniteness, it is reduced into -formula of Presburger arithmetic.
Corollary 2
For any word , it is decidable whether or not.
Proof
Let . We can effectively check holds or not, since is finite. If then two languages are not equivalent. Otherwise, enumerate all possible traces in , then for each trace , and check whether every multi-trace with satisfies
[TABLE]
or not. If there is some multi-trace that does not satisfy Condition ( ‣ 5.1) then , otherwise holds. Since every multi-trace can be represented by a corresponding trace and its multiplicity (positive coefficients), for a trace , the statement “every multi-trace with satisfies Condition ( ‣ 5.1)” can be represented by the following -formula of Presburger arithmetic :
[TABLE]
where is a subexpression defined by
[TABLE]
6 Open Problem
We would like to introduce the following open problem which asks the existence of a non-trivial finite WMIX language.
Problem 1 ([14])
Are there such that is finite but for some ?
Note that all examples of finite WMIX languages in this note are not of this type. The complexity issue is also interesting.
Problem 2
What is the complexity of the infiniteness problem (resp. the equivalence problem) for WMIX languages?
Acknowledgment. The author would like to thank Thomas Finn Lidbetter (University of Waterloo) for telling me this topic. The author also thank to an anonymous reviewer for letting me know some known results on unambiguous CA [10] and pointing out that the decidability results presented in this note are also from those results.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Joshi, A., Vijay-Shanker, K., Weir, D. The Convergence of Mildly Context-sensitive Grammar Formalisms. Foundational Issues in Natural Language Processing (1991) 31–82 cited By 1.
- 2[2] Marsh, W.: Some conjectures on indexed languages. Abstract appears in Journal of Symbolic Logic 51 (3) (1985) 849 Paper presented to the Association for Symbolic Logic Meeting, Stanford University, July 15–19, 1985.
- 3[3] Boullier, P.: Chinese numbers, mix, scrambling, and range concatenation grammars. In: Proceedings of the Ninth Conference on European Chapter of the Association for Computational Linguistics. EACL ’99, Stroudsburg, PA, USA, Association for Computational Linguistics (1999) 53–60
- 4[4] Kanazawa, M., Salvati, S.: Mix is not a tree-adjoining language. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1. ACL ’12, Stroudsburg, PA, USA, Association for Computational Linguistics (2012) 666–674
- 5[5] Salvati, S.: Mix is a 2-mcfl and the word problem in z 2 is captured by the io and the oi hierarchies. Journal of Computer and System Sciences 81 (7) (2015) 1252 – 1277
- 6[6] Sorokin, A.: Ogden property for linear displacement context-free grammars. In Artemov, S., Nerode, A., eds.: Logical Foundations of Computer Science, Cham, Springer International Publishing (2016) 376–391
- 7[7] Parikh, R.J.: On context-free languages. J. ACM 13 (4) (October 1966) 570–581
- 8[8] Colbourn, C.J., Dougherty, R.E., Lidbetter, T.F., Shallit, J.: Counting subwords and regular languages. In: Developments in Language Theory - 22nd International Conference, DLT 2018, Tokyo, Japan, September 10-14, 2018, Proceedings. (2018) 231–242
