Succinct Representation for (Non)Deterministic Finite Automata
Sankardeep Chakraborty, Roberto Grossi, Kunihiko Sadakane, Srinivasa, Rao Satti

TL;DR
This paper introduces space-efficient data structures for representing deterministic and non-deterministic finite automata, enabling fast string acceptance queries and standard automata operations with optimal or near-optimal space and time complexity.
Contribution
The authors develop succinct representations for both deterministic and non-deterministic finite automata, achieving optimal space and query time for acyclic automata and efficient algorithms for automata operations.
Findings
Succinct data structure for deterministic finite automata with optimal space and query time.
Improved space bounds for acyclic deterministic automata with optimal acceptance checking time.
Succinct representation of non-deterministic finite automata enabling efficient acceptance decision.
Abstract
Deterministic finite automata are one of the simplest and most practical models of computation studied in automata theory. Their conceptual extension is the non-deterministic finite automata which also have plenty of applications. In this article, we study these models through the lens of succinct data structures where our ultimate goal is to encode these mathematical objects using information-theoretically optimal number of bits along with supporting queries on them efficiently. Towards this goal, we first design a succinct data structure for representing any deterministic finite automaton having states over a -letter alphabet using bits of space, which can determine, given an input string over , whether accepts in time, using constant words of working space.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicssemigroups and automata theory · Machine Learning and Algorithms · Network Packet Processing and Optimization
\Copyright
Chakraborty, Grossi, Sadakane and Satti\EventEditorsJohn Q. Open and Joan R. Acces \EventNoEds2 \EventLongTitle42nd Conference on Very Important Topics (CVIT 2016) \EventShortTitleCVIT 2016 \EventAcronymCVIT \EventYear2016 \EventDateDecember 24–27, 2016 \EventLocationLittle Whinging, United Kingdom \EventLogo \SeriesVolume42 \ArticleNo23
Succinct Representation for (Non)Deterministic Finite Automata
Sankardeep Chakraborty
RIKEN Center for Advanced Intelligence Project, Japan
Roberto Grossi
Dipartimento di Informatica, Università di Pisa, Italy
Kunihiko Sadakane
The University of Tokyo, Japan
Srinivasa Rao Satti
Seoul National University, South Korea
Abstract.
Deterministic finite automata are one of the simplest and most practical models of computation studied in automata theory. Their conceptual extension is the non-deterministic finite automata which also have plenty of applications. In this article, we study these models through the lens of succinct data structures where our ultimate goal is to encode these mathematical objects using information theoretically optimal number of bits along with supporting queries on them efficiently. Towards this goal, we first design a succinct data structure for representing any deterministic finite automaton having states over a -letter alphabet using bits of space, which can determine, given an input string over , whether accepts in time, using constant words of working space. When the input deterministic finite automaton is acyclic, not only we can improve the above space bound significantly to bits, we also obtain optimal query time for string acceptance checking. More specifically, using our succinct representation, we can check if a given input string can be accepted by the acyclic deterministic finite automaton using time proportional to the length of , hence, the optimal query time. We also exhibit a succinct data structure for representing a non-deterministic finite automaton having states over a -letter alphabet using bits of space, such that given an input string , we can decide whether accepts efficiently in time. Finally, we also provide time and space efficient algorithms for performing several standard operations such as union, intersection and complement on the languages accepted by deterministic finite automata.
Key words and phrases:
Succinct Data Structures, Encoding Schemes, Finite Automata
1991 Mathematics Subject Classification:
Dummy classification – please refer to http://www.acm.org/about/class/ccs98-html
1. Introduction
Automata theory is a branch of theoretical computer science that deals exclusively with the definitions, properties and applications of different mathematical models of computation. These models play a major role in multiple applied areas of computer science. One of the most basic and fundamental models that is studied in automata theory since long time back is called the finite automata. It primarily comes in two different types, deterministic finite automata (henceforth DFA) and non-deterministic finite automata (henceforth NFA) among others. There exists more complex and sophisticated models as well, for example, Context-free grammar, Turing machines etc. In what follows, let us formally define DFA and NFA in a nutshell as these are our primary subjects of study in this article. A DFA is a quintuple where:
- •
is an alphabet; a finite set of letters,
- •
is the finite set of states,
- •
is the initial state,
- •
is the transition function and
- •
is the set of final states.
We often extend the transition function to which is defined recursively as follows: for all , where is the empty string; and for all , , and . Given the above definition, we say that the DFA accepts a string over the alphabet if and only if . The language accepted by a DFA is defined as the set of all strings accepted by the DFA , and is denoted by . See Figure 1 for a simple example. In the rest of this paper, we assume that the alphabet is , and the state set is .
A deterministic automaton is called acyclic [16] if it has a unique recurrent state where a state is defined as recurrent if there exists a non-empty string over such that . Non-recurrent states are typically called transient, and the unique recurrent state (denoted by ) is classically called the dead state as for all .
An NFA is a conceptual extension of DFAs where the definition of the transition function is mainly extended. More specifically, for DFA, the transition function is defined as : whereas for NFA, the same is defined as : where denotes the power set of . Another extension, which is sometimes used in the literature, is to simply allow more than one initial state in an NFA, and in this case, the third item in the tuple becomes denoting the set of initial states, instead of singleton . The rest of above quintuple definition remains as it is for NFA. Thus, in the case of NFA , the language is defined as . We refer the readers to the classic texts of [14, 23] for a thorough discussions on these mathematical models and automata theory in general.
Even if a DFA is defined as an abstract mathematical concept, still it has got myriad of practical applications. More specifically, it is used in text processing, compilers, and hardware design [23]. Quite often it is implemented in small hardware and software tools for solving various specific tasks. For example, a DFA can model a software that can figure out whether or not online user input such as email addresses are valid. DFAs/NFAs are also used for network packet filtering. In some of these applications, the alphabet is large and there is a failure/exit state so that only a subset of transitions go to non-failure states; so we call the latter ones non-failure transitions.
Despite having so many applications in practically motivated problems, we are not aware of, to the best of our knowledge, any study of DFAs and NFAs from the point of view of succinct data structures where the goal is to store an arbitrary element from a set of objects using the information theoretic minimum bits of space while still being able to support the relevant set of queries efficiently, which is what we focus on in this paper. We also assume the usual model of computation, namely a -bit word RAM model where is the size of the input.
1.1. Related Work
The field of succinct data structures originally started with the work of Jacobson [15], and by now it is a relatively mature field in terms of breadth of problems considered. To illustrate this further, there already exists a large body of work on representing various combinatorial objects succinctly. A partial list of such combinatorial objects would be trees [18, 21], various special graph classes like planar graphs [2], chordal graphs [19], partial -trees [11], interval graphs [1] along with arbitrary general graphs [12], permutations [17], functions [17], bitvectors [22] among many others. We refer the reader to the recent book by Navarro [20] for a comprehensive treatment of this field. The study of succinct data structures is motivated by both theoretical curiosity and also by the practical needs as these combinatorial structures do arise quite often in various applications.
For DFA and NFA, other than the basic structure that is mentioned in the introduction, there exists many extensions/variations in the literature, for example, two-way finite automata, Büchi automata and many more. Researchers generally study the properties, limitations and applications of these mathematical structures. One such line of study that is particularly relevant to us for this paper is the research on counting DFAs and NFAs. Since the fifties there are plenty of attempts in exactly counting the number of DFAs and NFAs with states over the alphabet , and the state-of-the-art result is due to [3] for DFAs and [10] for NFAs respectively. We refer the readers to the survery (and the references therein) of Domaratzki [9] for more details. Basically, from these results, we can deduce the information theoretic lower bounds on the number of bits required to represent any DFA or NFA. Then we augment these lower bounds by designing data structures whose size matches the lower bounds, hence consuming optimal space, along with capable of executing algorithms efficiently using this succinct representation, and this is the main contribution of this paper.
1.2. DFA and NFA Enumeration
After a number of efforts by several authors, finally Bassino and Nicaud [3] found a matching upper and lower bound on the number of non-isomorphic initially-connected111Note that this assumption always implies that the language accepted by the DFA is non-empty. (i.e., all the states are reachable from the initial state) DFA’s with (including a fixed initial and one or possibly more final) states over an alphabet (where ) is where denotes the Stirling numbers of the second kind222It is defined recursively as , for all and for all , .. Using the approximation of the Stirling numbers of the second kind [13], which states that , we can obtain the information theoretic lower bound for representing any DFA having states and -sized alphabet is given by bits. On the other hand, Domaratzki et al. [10] showed that there are asymptotically initially connected NFAs on states over a -letter alphabet with a fixed initial state and one or more final states. Thus, information theoretically, we need at least bits to represent any NFA. In what follows later, we show that we can represent any given DFA/NFA using asymptotically optimal number of bits as mentioned here. Throughout this paper, we assume that the input DFAs/NFAs that we want to encode succinctly are initially connected.
1.3. Our Main Results and Paper Organization
The classical representation of DFAs/NFAs consists of explicitly writing the transition function in a two dimensional array having rows corresponding to the states of the DFA/NFA and (where ) columns corresponding to the alphabet such that where . For DFA, the entry in is a singleton set whereas for NFA it could possibly contain a set having more than one state. Thus, the space requirement for representing any given DFA (NFA respectively) is given by ( respectively) bits. These space bounds are clearly not optimal – for the DFAs, it is off by an additive term from the information theoretic minimum, while for the NFAs, it is off by a multiplicative factor of from the optimal bound. We alleviate this discrepancy in the space bounds by designing optimal succinct data structures for these objects.
Towards this goal, we start by listing all the preliminary data structures and graph theoretic terminologies that will be required in our paper in Section 2. Then, in Section 3.1 we first discuss the relevant prior work from [3], and show that, by using suitable data structures, their work already gives a succinct encoding of DFA. But the major drawback of this encoding is that it is not capable of handling the problem of checking whether a string is accepted by the DFA extremely efficiently. In Section 3.2, we overcome this problem by designing a succinct data structure for DFA, which can also check the string acceptance almost optimally. We summarize our main result in the following theorem.
Theorem 1.1**.**
Given an initially-connected deterministic finite automata having states and working over an alphabet of size , there exists a succinct encoding for taking bits of space, which can determine, given an input string over , whether accepts in time, using constant words of working space. If the DFA has only non-failure transitions, then the space can be further reduced to bits.
The upper bounds in Theorem 1.1 save roughly bits with respect to the immediate representation of the DFA. The former upper bound is optimal as it matches the information-theoretical lower bound in Section 1.2, up to lower order terms. As for the latter upper bound, we do not know its optimality but it is smaller than the information-theoretical lower bound of bits derived for edge-labeled deterministic directed graphs [12]. Indeed, DFAs can be seen as a special case of these graphs where is the number of nodes, is the number of arcs, and is the maximum node degree.333A directed graph with labels on its arcs is deterministic if no two out-neighbor arcs have the same label. Since there are directed graphs [12] with nodes and arcs, each deterministic graph can have label assignments for its arcs, where s the out-degree of node and . Note that when labels are from and thus .
We can improve the above space bound significantly if the given DFA is acyclic along with obtaining optimal query time for string acceptance checking. More specifically, in Section 3.3, we obtain the following result in this case.
Theorem 1.2**.**
Given an initially-connected acyclic deterministic finite automata having transient states, a unique dead state and working over an alphabet of size , there exists a succinct encoding for taking bits of space, which can optimally determine, given an input string over , whether accepts in time proportional to the length of , using constant words of working space.
This is followed by the succinct data structure for NFA in Section 3.4 where we prove the following result.
Theorem 1.3**.**
Given an initially-connected non-deterministic finite automata having states and working over an alphabet of size , there exists a succinct encoding for taking bits of space, which can determine, given an input string over , whether accepts in time, using bits of working space.
Next we move on to discuss how one can support several standard operations such as union and intersection of two languages accepted by the deterministic finite automata. Classically it is done via the product automaton construction [14, 23], and here we provide a time and space efficient algorithm for performing this construction. More specifically, we show the following theorem (proof and other details are provided in Appendix A.1),
Theorem 1.4**.**
Suppose we are given the succinct representations for two DFAs (having states) and (having states) respectively such that both are working over the same alphabet . Also suppose that the product automata (denoted by ) has states where . Then, using expected time and bits of working space, we can directly construct a succinct representation for . Moreover, can be represented optimally using bits overall, and by suitably defining the final states of , we can make accept either or . Finally, given an input string over , we can decide whether in time using constant words of working space.
Finally, we conclude in Section 4 with some concluding remarks.
2. Preliminaries
In this section we collect all the previous theorems and definitions that will be used throughout this paper.
2.1. Graph Terminology and Graph Algorithms
We will assume the knowledge of basic graph theoretic terminology (like trees, paths etc) as given in [6] and basic graph algorithms (mostly the depth first search (henceforth DFS) traversal of a graph and its related concepts) as given in [5]. Perhaps at this point it may seem slightly unusual that we are talking about graphs here when the focus of this paper is DFA/NFA and their succinct representations. Essentially in this paper we view DFA/NFA, more specifically their graphical representation i.e., state transition diagram, as a special case of an edge labeled directed graph having nodes corresponding to the states of DFA/NFA, edges where as each node has exactly outgoing edges, and each edge is labeled with some elements from . It is with this point of view, we will design our succinct data structures for DFA/NFA in this paper.
2.2. Succinct Data Structures
Rank-Select. For a bit vector and any , the rank and select operations are defined as follows :
- •
= the number of occurrences of in , for ;
- •
= if , and otherwise; and
- •
= the position in of the -th occurrence of , for .
We make use of the following theorems:
Theorem 2.1**.**
[4*]**
We can store a bitstring of length with additional bits such that rank and select operations can be supported in time. Such a structure can also be constructed from the given bitstring in time and space.*
Theorem 2.2**.**
[22*]**
We can store a bitstring of length with ones using bits such that operations can be supported in time. Such a structure can also be constructed from the given bitstring in time and space.*
Succinct tree representation. We use following result from [18].
Theorem 2.3**.**
[18*]**
Given a rooted ordered tree on nodes, it can be succinctly represented as a sequence of balanced parenthesis of length bits, such that given a node , we can support subtree size and various navigational queries (such as parent and -th child) on in time using an additional bits. Such a structure can also be constructed in time and space.*
Compact representation of increasing sequence. We use the following theorem from [24].
Theorem 2.4**.**
[24*]**
Given an increasing integer sequence of length such that , there exists a data structure to represent in compressed form using bits of space, where is any parameter, such that any entry and the value can be retrieved in time.*
We denote the above data structure by . If denotes the characteristic vector for the sequence , then computing and correspond to computing select and rank on .
Representation of a vector. We also make use of the following theorem from [8].
Theorem 2.5**.**
[8*]**
There exists a data structure that can represent a vector of elements from a finite alphabet using bits, such that any element of the vector can be read or written in constant time.*
3. Succinct Representations for DFA and NFA
In this section, we provide all the upper bound results of our paper dealing with DFA/NFA. Throughout this section, whenever we mention DFA (NFA resp.), it should refer to an initially-connected deterministic (non-deterministic resp.) finite automata having states and working over an alphabet of size . With this notation in mind, we start with the succinct encoding of DFA first.
3.1. Succinct Encoding of DFA
Bassino and Nicaud [3] proved a beautiful bijection between the state transition diagram of any DFA and pairs of integer sequences which can be represented by boxed diagrams (will be defined shortly) along with providing an efficient algorithm to perform this construction. We will refer the readers to [3] for complete details regarding the bijection, counting and many other details that we choose to not repeat here. However, we still need to provide some details/definitions (which basically follow their exposition) that are relevant to our own work and will also help to understand the results from their paper smoothly. Following [3], a diagram of width and height is defined as a sequence of non-decreasing non-negative integers such that , represented as a diagram of boxes. See Figure 2 for better visual description and understanding. A boxed diagram can be defined as a pair of sequences where is a diagram and for all (such that ), the -th box of the column of the diagram is marked. Note that . Thus, a diagram can lead to boxed diagrams. A k-Dyck diagram of size is defined as a diagram of width and height such that for all . Finally, a k-Dyck boxed diagram of size is boxed diagram where the first coordinate is a k-Dyck diagram of size . Given these definitions, Bassino and Nicaud [3] proved the following theorem.
Theorem 3.1**.**
[3*]**
The set containing DFAs having states and working over a -letter alphabet is in bijection with the set of -Dyck boxed diagrams of size . Moreover, the construction involving going from transition diagram of the DFA to -Dyck boxed diagram and vice versa runs in linear time and space.*
Thus, by applying the above theorem, from any given DFA with states and -letter alphabet, [3] produces a -Dyck boxed diagrams of size , which can be in turn represented by two integer arrays and of length each. Furthermore, from these two arrays, it is possible to entirely reconstruct the DFA using the algorithm of Theorem 3.1. Thus, it is sufficient to store just these two arrays in order to encode any given DFA. For more details, readers are referred to [3]. For an example, see Figure 3 which will also serve as the working example for this part of our paper. In particular, the DFA of Figure 3 can be entirely encoded by the and arrays of length , and these can be computed using the algorithms of [3].
First, we observe that, by construction, the arrays satisfy and for each . This happens precisely because the translation is obtained by following a DFS on the DFA using the lexicographic order of words, and on each backtracking edge adding to the first vector the number of states scanned so far, and to the second vector the state reached. This also explains why each entry of these two arrays are upper bounded by , the number of states of the given DFA. Now we consider the number of bits needed to encode the array . As it is an increasing integer sequence of length and the range of the values is , by using data structure of Theorem 2.4, this array can be represented using bits of space. By letting , the size is bits if . If , the space is obviously bits. Next we consider the number of bits required for array . Because each entry of this array is an integer from to , we can use Theorem 2.5 to represent the array using (recall ) bits. Thus, in total, the size of the representation using two integer arrays is bits. Because the information theoretic lower bound is bits for the representation of DFA, this representation is succinct.
We consider a special case when there is a failure/exit state labeled [math] and only transitions among all the transitions go to non-failure states. Note that has non-zero values. In this case we can reduce the space for by using a new bitvector which has ones. We use a new array which stores non-zero values of . Then is computed as follows. If , (transition to the failure state). If , . If we use the data structure of Theorem 2.1, is represented in bits, which is asymptotically smaller than the space lower bound of . But, by using the data structure of Theorem 2.2, the bitvector can be represented in bits to support queries in time. The space for is bits. Therefore the total space for representing a DFA with non-failure transitions is bits.
Even though this representation is optimal from the point of view of space occupancy, one major drawback of this representation is that, given a string over , it takes linear time (in the size of the DFA, i.e., time where is number of states of the DFA and is total number of transitions or edges in state transition diagram of the DFA) to decide whether the DFA accepts the string , which is clearly not optimal as ideally it should be performed in time . This happens because the algorithm of Theorem 3.1 actually unravels the DFA from these two arrays and , and then checks whether the input string can be accepted or not. Thus, from the point of view of string acceptance, this encoding of DFA is not optimal whereas space requirement point of view, this is optimal. This motivates the need of a succinct encoding of a given DFA, where the problem of string acceptance can be performed in almost optimal time (i.e., almost in time proportional to the string length). In what follows, we provide such an encoding.
3.2. Succinct Data Structure for DFA
Data structure: To design a succinct data structure for DFA, we need the following three bitvectors , and in addition to an integer array (that can be obtained from the array of the previous section, as described later), which are defined as follows.
is a balanced parentheses sequence of length obtained from the lexicographic depth-first search (DFS) tree of the given input automaton . More specifically, given any DFA , we first perform the lexicographic DFS on to generate the lexicographic DFS tree of , i.e., while looking for a new edge to traverse during DFS, the algorithm always searches in lexicographic order of edge labels. For example, in Figure 3, from any vertex, lexicographic DFS first tries to traverse the edge labeled , followed by and finally . The tree is represented as a balanced parenthesis sequence together with auxiliary structures to support the navigational queries on , as mentioned in Theorem 2.3, using bits. The bitvector is used to mark all the final states of the input DFA, hence it takes bits.
Before explaining the other bitvector, , required for our succinct encoding, we want to explain the contents of Figure 4. The tree depicted in the figure is what we call an extended lexicographic DFS tree or extended lex-DFS tree (denoted by ) in short. If we delete the squared nodes and their incident edges (originating from the circled nodes), we obtain the lexicographic DFS tree of the automaton . Actually these edges represent the back edges/cross edges/forward edges [5] (i.e., non-tree edges) in the DFS tree of the automaton . Traditionally the vertices in the square are not drawn (as in our case of Figure 4), rather the edges point to the nodes in the circle only (hence all the nodes appear only once). We have chosen to draw and define the extended lex-DFS tree this way as it helps us to design and explain our succinct data structure well. Also note that, edges originating from a circled node and going to another circled node represents tree edges whereas edges from circled to squared nodes represent non-tree edges.
Now given the extended lex-DFS tree , we visit the nodes of in DFS order and append a bit string of length for each vertex of marking which of its children are attached to via tree edges (marked with ) and which are attached to via non-tree edges (marked with [math]) in the lexicographic order of the edge labels. The string obtained this way is referred to as . Thus, is a bit-vector of length which captures the information about the tree and non-tree edges of . More specifically, it has exactly ones, which have one-to-one correspondence with the tree edges of the lexicographic DFS tree of DFA , and has exactly zeros, which correspond to non-tree edges of the lexicographic DFS tree of DFA . See Figure 4 for an example. We relabel all the states of such that the -th vertex (state) in in preorder has label , and also modify the transition function accordingly. Now it is easy to see that, for the state with label (), the corresponding node in the lexicographic DFS tree has exactly outgoing edges, and we encode the tree edges among them using the bits in the range . More specifically, if and only if the outgoing edge labeled is a tree edge (). Similarly, we can also find the -th outgoing tree edge from the state by . Finally, we compress by observing that the positions of s in the array form an increasing sequence, hence by using the data structure of Theorem 2.4, , and operations can be supported in constant time. By setting , can be encoded in bits.
Now let us define the new integer array . First, observe that elements of the array are nothing but the leaves (i.e., node labels in the squared nodes) of the extended lex-DFS tree in the left to right order. More specifically, they are the node labels of the destinations of the non-tree edges emanating from the nodes of the lexicographic DFS tree of the automaton in their preorder. Instead of this specific ordering (followed in the array), lists the same node labels in the order of their appearance in the bitvector (from left to right). Note that, as mentioned previously, these node are marked by [math]s in and they are in one-to-one correspondence with all the non-tree edges of the lexicographic DFS tree of the automaton . Thus, the array contains the same node labels as the array, but in a different order. See Figure 4 for an example. This completes the description of our succinct data structure for DFA. Note that is no longer used in our data structure.
We now analyze the space complexity of our data structure. The array takes bits (by similar analysis as before for the array). As mentioned previously, we store using Theorem 2.4, hence it takes bits. The bitvector consumes bits. Finally, the bitvector is stored using Theorem 2.3, hence it occupies bits in total. Thus, overall our data structure uses bits. Hence, the data structure is succinct. It is easy to further reduce the size if the DFA has only non-failure transitions. Using the bitvector for indicating non-failure transitions, the array is compressed to non-zero values, and the total space is bits. In what follows, we describe the string acceptance query algorithm using our data structures.
Query algorithm. Suppose we are given an input string of length over , and we need to decide if the DFA accepts or not. We start the following procedure from the initial state (stored explicitly using bits) and repeat until the end of the input string . At any generic step, to figure out the transition function where are the states, we first look at the bit . If it is , the outgoing edge labeled from state is a tree edge. Let . Then the outgoing edge is the -th tree edge of node in the lex DFS tree. Therefore (supported using the Theorem 2.3). If the bit is [math], the outgoing edge labeled from state is a non-tree edge. Let . Then the edge is the -th non-tree edge in the DFA, and is obtained by . Hence, when we reach the end of , and if we are at an accepting/final states (can be figured out from the bitvector ), we say that the DFA accepts . The operations on take time while all other operations, at each step, take time. Thus the overall run time for checking the membership of an input string is . This completes the proof of Theorem 1.1.
Remark: In the light of the above discussion, consider the following. Suppose we are given as input a succinct representation for a DFA whose language is , and our goal is to construct the succinct representation for the DFA (say ) which accepts complement of i.e., . In order to construct the succinct representation for , we start with the succinct representation for (that is given in terms of three bit vectors and the integer array ), and simply convert (in the array) each final state in into a non-final state in and convert each non-final state in into a final state in without changing any other data structures. As a consequence, it is easy to see that, we will end up with what we desired.
3.3. Succinct Data Structures for Acyclic DFA
As mentioned previously, an acyclic DFA with total states always has a unique dead state and transient (i.e., non dead) states. Another way to visualize is to see that the state transition diagram of does not have any cycles except at the unique dead state. Given such a setting, one can always use the succinct encoding (of the previous section) of an arbitrary DFA to represent them. In that case, we end up using bits of space. In what follows, we show that by exploiting the acyclic property, one can obtain improved space bound for representing .
We basically view the state transition diagram of as a directed acyclic graph with a single source (i.e., the initial state), and a single sink i.e., the dead state (call it ). Given this, we first construct a spanning tree of where (i.e., the set of states of ) and by making the dead state as the root of this tree. It is easy to see that such a spanning tree can always be constructed. By applying Theorem 2.3, we encode the structure of using bits to support the navigational queries on (in particular, the parent query) in time. As done previously in Section 3.2 while constructing the succinct data structures for DFA, here also we relabel all the states of such that the -th vertex (state) in in preorder has label , and modify the transition function accordingly. Note that the dead state is labeled with label [math] in this ordering, and we do not need to store the transition function for the dead state. We also mark in a bitvector of size all the final states of , and we store the label of the start state. We then store a two dimensional array such that using data structure of Theorem 2.5. Thus, the overall space usage is bits.
In what follows, we explain how to check if accepts any given string over . At any generic step, to compute , we simply output if ; otherwise (i.e., if ) the value of is given by the parent of in i.e., . Thus can be computed in constant time, and hence we can optimally decide if accepts in time proportional to the length of . This completes the proof of Theorem 1.2.
3.4. Succinct Encoding for NFA
As mentioned previously in Section 1.2, to encode an initially connected NFA on states over a -letter alphabet with a fixed initial state and one or more final states, we need at least bits. In what follows, we show a very simple scheme achieving this bound.
We store a table having rows (corresponding to the states of the input NFA) and columns (corresponding to each letter of the alphabet ). The entry (where and ) basically stores the corresponding transition function of the NFA i.e., where and . Now for an NFA, is a subset of . If we store this subset explicitly, it might take bits in the worst case per transition of the NFA, leading to overall bits which is multiplicative factor off from the optimal space requirement. Instead we simply store the charecteristic vector of the subset (of length , marking the corresponding states from the subset as , and rest of the bits in are [math]) where the state labeled of the NFA moves to after reading the letter . Thus, the overall size of is exactly bits. Finally, we also mark in a separate bitvector (of length ) all the final states of the input NFA. Thus, in total the size of our encoding is given by bits, which matches the lower bound. Hence, our encoding is succinct and optimal.
Now using our encoding, we can simply implement the classical algorithm (given in the texts of [14, 23]) for checking if the NFA accepts a given input string or not, and this runs in time where is the input string and denotes its length. Note that we also need two bitvectors of length each (hence overall bits) as working space to mark two sets of intermediate states between successive transitions while executing the string acceptance checking algorithm. Hence, we obtain the result mentioned in Theorem 1.3.
4. Concluding Remarks
We considered the problem of succinctly encoding any given DFA , acyclic DFA or NFA so as to check efficiently if they accept a given input string. To this end, we successfully designed succinct data structures for them that also support the string acceptance query efficiently for DFAs, acyclic DFAs, and NFAs. To the best of our knowledge, our work is the first attempt to encode any mathematical models from the world of automata theory using the lens of succinct data structures, and we believe that our work will spur further interest in other similar problems in future.
Appendix A Appendix
A.1. Supporting More Operations (Union and Intersection)
In what follows we show how to support some standard operations on DFAs space efficiently. We start with the classical example of product automaton construction. More specifically, given the succinct representation of two DFAs, we want to construct a succinct representation of the product automaton accepting the language which is the union/intersection of the two input DFA’s language. Before providing our construction, let us formally define the product automaton construction. Suppose, we are given two DFAs and represented succinctly (as described in Section 3.2) and both working over the same alphabet . Then a product automaton (denoted by ) of and is defined as follows, where , and . Moreover, for any and , . The start state of is the pair whereas the final state can be defined in multiple ways. More specifically, if we set , then . Similarly, if we set , then . Now we show how one can directly construct a succinct representation of given the succinct representations of and as input, and note that, to do so we just need to describe how one can create the three bitvectors and the integer array corresponding to from the succinct representations of and directly. See Figure 5 and Figure 6 for a visual description of our product automaton construction algorithm.
For constructing the product automaton , our high level idea is to create the states and transitions of by generating the states of in the lexicographic DFS order using two passes. In the first pass, we generate the and arrays (both initialized with empty string), and this is followed by the construction of the array in the second pass. More specifically, we start by creating the initial state i.e., as the first circled node i.e., root in the extended lex-DFS tree corresponding to , store an entry corresponding to this node in the hash table along with storing its preorder number (which is in the case of ) as a satellite data in the hash table. Also we append zero bits to corresponding to the root. In general, at any point of time during the execution of this algorithm, the hash table stores an entry corresponding to each of the circled nodes generated upto that point along with storing its preorder number and its parent node as satellite data. Note that for the root, we don’t need to store any parent information. Now to figure out the transitions out of any state, note that, if we use the method described in the query algorithm for DFA (as described in Section 3.2) we need to pay time per symbol of the alphabet . Instead, in what follows, we show how one can find each transition in time per symbol out of any state using all the information that is already stored in the input i.e., succinct representations for and . Assume for now that we can do so and also suppose that at some point of the algorithm, we created a new circled node . Then we proceed as follows. First we append zero bits to the bit string corresponding to the node . This is followed by the expansion of the state by generating the transitions in the lexicographic ordering of the alphabet characters , as follows. Let and , then we check in the hash table if the state has already been created before (by checking membership in the hash table). If yes, we create a squared node as a child node of (which is a circled node) and don’t make any changes to the array, mark the -th bit corresponding to the node in as zero; and continue with the expansion of with the next character in . If not, we create a circled node as a child of , append an open parenthesis to the array constructed so far, mark the -th bit corresponding to the node in as one, and finally insert into the hash table along with inserting as its parent and its preorder number as its satellite data; and continue with the expansion of . Finally, when we exhaust checking all the characters out of , we backtrack to the parent of in the extended lex-DFS tree (using the parent information stored as a satellite data with the entry for the node ), and in this case, we simply append a close parenthesis to the array constructed so far. It is clear that using this procedure repeatedly we can successfully create and arrays corresponding to the product automaton . Finally, we create the all the auxiliary structures (mentioned in Section 2.2) on top of the arrays and (similar to the succinct data structure for DFA as described in Section 3.2) for supporting various navigational queries on the extended lex-DFS tree. Intuitively the array stores the topology of the extended lex-DFS tree of the state transition diagram of the product automaton and the array stores the parent-child relationship between the nodes of the extended lex-DFS tree in a compact manner. Now let’s discuss how to find out the transitions efficiently. Note that it suffices to describe how one can find in ( in can be found similarly). We consider the two cases: when the edge is a (i) non-tree edge, or a (ii) tree edge. In case (i), . In case (ii), (can be supported using the Theorem 2.3 on the array) where .
In what follows, we describe how one can fill up the integer array with (we discuss about fixing later) entries which are initialized with all one. Note that, similar to the succinct DFA construction, this array should contain the preorder number of the node labels in the squared nodes of the extended lex-DFS tree in the order of their appearance in the bitvector (from left to right). Moreover, these node are marked by [math]s in and they are in one-to-one correspondence with all the non-tree edges of the extended lex-DFS tree of the product automaton . To fill up array, we follow essentially the same lexicographic DFS traversal procedure as we did in the first pass except the following. More specifically, we start the second pass of the extended lex-DFS tree and whenever we encounter a non-tree edge, we retrieve the preorder number corresponding to the node label in the squared node (i.e., the other end point of that non-tree edge) from the hash table, and insert this number at the suitable position in the array. In detail, suppose we are at a circled node (with preorder number, say, ) and currently exploring the transition with the letter out of . Also assume that and is a squared node (i.e., is a non-tree edge) such that the preorder number associated with the node label is in the hash table. Then, we assign where . Finally, depending on union or intersection operation, we also mark in another bitvector (according to the definition given above) all the final states of the product automaton . Observe that once we have all the constituent data structures (including all the auxiliary data structures that we build on top of arrays and the integer array ) for the succinct representation for ready, we can essentially use the same query algorithm for string acceptance checking as we described for DFA in Section 3.2.
Let’s analyze the resource requirements for our algorithm. Suppose and , then the product automaton can have states at the worst case, but in general it could be much less as well. Let us suppose that has states, then , and in what follows, we write our space requirement as a function of . If we implement the hash table using the data structure of [7], then it consumes bits in total. Also note that this is the dominating term for the working space bound as other auxiliary data structures consume negligible space with respect to the space consumption for the hash table. Moreover, our algorithm runs in linear (in ) expected time overall. The randomized nature of our algorithm is due to the fact of using the hashing data structure of [7] whereas all the other parts of our algorithm is deterministic. As a result of our algorithm, we generate a representation for and this is given by the following arrays. The bitvectors and consume , bits respectively. For the array, we compress it by observing that the positions of s in the array form an increasing sequence, hence by using the data structure of Theorem 2.4, , and operations can be supported in constant time, and by setting , can also be encoded in bits. Finally, the array has entries where and each entry could be upto . Thus, using the data structure of Theorem 2.5, can be encoded using bits. Thus, our algorithm produces a representation of the product automaton using bits overall, and this is succinct. This completes the description of the product automaton construction algorithm as stated in Theorem 1.4.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] H. Acan, S. Chakraborty, S. Jo, and S. R. Satti. Succinct data structures for families of interval graphs. In WADS , 2019.
- 2[2] L. C. Aleardi, O. Devillers, and G. Schaeffer. Succinct representations of planar maps. Theor. Comput. Sci. , 408(2-3):174–187, 2008.
- 3[3] F. Bassino and C. Nicaud. Enumeration and random generation of accessible automata. Theor. Comput. Sci. , 381(1-3):86–104, 2007.
- 4[4] D. R. Clark. Compact Pat Trees . Ph D thesis. University of Waterloo, Canada, 1996.
- 5[5] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms (3. ed.) . MIT Press, 2009.
- 6[6] R. Diestel. Graph Theory, 4th Edition , volume 173 of Graduate texts in mathematics . Springer, 2012.
- 7[7] M. Dietzfelbinger, A. R. Karlin, K. Mehlhorn, F. Meyer auf der Heide, H. Rohnert, and R. E. Tarjan. Dynamic perfect hashing: Upper and lower bounds. SIAM J. Comput. , 23(4):738–761, 1994.
- 8[8] Y. Dodis, M. Patrascu, and M. Thorup. Changing base without losing space. In STOC , pages 593–602, 2010.
