Weighted Automata Extraction from Recurrent Neural Networks via Regression on State Spaces
Takamasa Okudono, Masaki Waga, Taro Sekiyama, Ichiro Hasuo

TL;DR
This paper introduces a novel method for extracting weighted finite automata from RNNs using regression techniques, enhancing the interpretability and analysis of neural network internal states.
Contribution
It extends existing automaton extraction methods by incorporating regression for equivalence queries, enabling weighted automaton extraction from RNNs.
Findings
High accuracy in automaton extraction
Improved expressivity over previous DFA-based methods
Efficient extraction process demonstrated
Abstract
We present a method to extract a weighted finite automaton (WFA) from a recurrent neural network (RNN). Our algorithm is based on the WFA learning algorithm by Balle and Mohri, which is in turn an extension of Angluin's classic \lstar algorithm. Our technical novelty is in the use of \emph{regression} methods for the so-called equivalence queries, thus exploiting the internal state space of an RNN to prioritize counterexample candidates. This way we achieve a quantitative/weighted extension of the recent work by Weiss, Goldberg and Yahav that extracts DFAs. We experimentally evaluate the accuracy, expressivity and efficiency of the extracted WFAs.
| 2.17 / 286 | 2.39 / 338 | 26.8 / 165 | 9.77 / 279 | 4.36 / 545 | 4.07 / 716 | 2.33 / 1390 | |
|---|---|---|---|---|---|---|---|
| 2.45 / 1787 | 2.54 / 1302 | 6.99 / 386 | 4.48 / 641 | 4.08 / 1218 | 3.15 / 1410 | 2.28 / 2480 | |
| 4.68 / 7462 | 4.46 / 5311 | 22.5 / 928 | 11.9 / 1562 | 5.90 / 3521 | 4.55 / 3638 | 3.55 / 5571 | |
| 5.62 / 8941 | 5.78 / 8564 | 21.2 / 2155 | 10.6 / 4750 | 7.87 / 5692 | 5.71 / 7344 | 5.27 / 7612 | |
| 3.70 / 7610 | 3.79 / 7799 | 6.24 / 2465 | 10.1 / 2188 | 6.13 / 3106 | 3.70 / 5729 | 3.63 / 7473 | |
| 7.34 / 9569 | 5.52 / 10000 | 13.5 / 3227 | 8.01 / 6765 | 6.07 / 7916 | 5.98 / 8911 | 6.17 / 8979 | |
| 8.44 / 10000 | 5.58 / 9981 | 16.3 / 2675 | 9.24 / 4850 | 7.28 / 5135 | 9.88 / 7204 | 6.44 / 8425 | |
| 9.16 / 7344 | 5.15 / 7857 | 13.7 / 2224 | 7.26 / 3823 | 6.60 / 5744 | 4.96 / 5674 | 4.01 / 9464 | |
| Total | 5.45 / 6625 | 4.40 / 6394 | 15.9 / 1778 | 8.92 / 3107 | 6.04 / 4110 | 5.25 / 5078 | 4.21 / 6549 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Weighted Automata Extraction from Recurrent Neural Networks
via Regression on State Spaces
Takamasa Okudono, Masaki Waga, Taro Sekiyama, Ichiro Hasuo
National Institute of Informatics & The Graduate University for Advanced Studies
2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430
{tokudono,mwaga,sekiyama,hasuo}@nii.ac.jp
Abstract
We present a method to extract a weighted finite automaton (WFA) from a recurrent neural network (RNN). Our method is based on the WFA learning algorithm by Balle and Mohri, which is in turn an extension of Angluin’s classic algorithm. Our technical novelty is in the use of regression methods for the so-called equivalence queries, thus exploiting the internal state space of an RNN to prioritize counterexample candidates. This way we achieve a quantitative/weighted extension of the recent work by Weiss, Goldberg and Yahav that extracts DFAs. We experimentally evaluate the accuracy, expressivity and efficiency of the extracted WFAs.
inline,linecolor=orange,backgroundcolor=orange!25,bordercolor=orange]Masaki: I changed the comments to “inline” because the margin is too small to use “not-inline” mode.
1 Introduction
Background
Deep neural networks (DNNs) have been successfully applied to domains such as text, speech, and image processing. Recurrent neural networks (RNNs) (?; ?) is a class of DNNs equipped with the capability of processing sequential data of variable length. The great success of RNNs has been seen in, e.g., machine translation (?), speech recognition (?), and anomaly detection (?; ?).
While it has been experimentally shown that RNNs are a powerful tool to process, predict, and model sequential data, there are known drawbacks in RNNs such as interpretability and costly inference. A research line that attacks this challenge is automata extraction (?; ?). Focusing on RNNs’ use as acceptors (i.e., receiving an input sequence and producing a single Boolean output), these works extract a finite-state automaton from an RNN as a succinct and interpretable surrogate. Automata extraction exposes internal transition between the states of an RNN in the form of an automaton, which is then amenable to algorithmic analyses such as reachability and model checking (?). Automata extraction can also be seen as model compression: finite-state automata are usually more compact, and cheaper to run, than neural networks.
Extracting WFAs from RNNs
Most of the existing automata extraction techniques target Boolean-output RNNs, which however excludes many applications. In sentiment analysis, it is desired to know the quantitative strength of sentiment, besides its (Boolean) existence (?). RNNs with real values as their output are also useful in classification tasks. For example, predicting class probabilities is a key in some approaches to semi-supervised learning (?) and ensemble (?).
This motivates extraction of quantitative finite-state machines as abstraction of RNNs. We find the formalism of weighted finite automata (WFAs) suited for this purpose. A WFA is a finite-state machine—much like a deterministic finite automaton (DFA)—but its transitions as well as acceptance values are real numbers (instead of Booleans).
Contribution: Regression-Based WFA Extraction from RNNs
Our main contribution is a procedure that takes a (real-output) RNN , and returns a WFA that abstracts . The procedure is based on the WFA learning algorithm in (?), that is in turn based on the famous algorithm for learning DFAs (?). These algorithms learn automata by a series of so-called membership queries and equivalence queries. In our procedure, a membership query is implemented by an inference of the given RNN . We iterate membership queries and use their results to construct a WFA and to grow it.
The role of equivalence queries is to say when to stop this iteration: it asks if and are “equivalent,” that is, if the WFA obtained so far is comprehensive enough to cover all the possible behaviors of . This is not possible in general—RNNs are more expressive than WFAs—therefore we inevitably resort to an approximate method. Our technical novelty lies in the method for answering equivalence queries; notably it uses a regression method—e.g., the Gaussian process regression (GPR) and the kernel ridge regression (KRR)—for abstraction of the state space of .
We conducted experiments to evaluate the effectiveness of our approach. In particular, we are concerned with the following questions:
-
how similar the behavior of the extracted WFA , and that of the original RNN , are;
-
how applicable our method is to an RNN that is a more expressive model than WFAs; and
-
how efficient the inference of the WFA is, when compared with the inference of .
The experiments we designed for the questions 1) and 3) are with RNNs trained using randomly generated WFAs. The results show, on the question 1), that the WFAs extracted by our method approximate the original RNNs accurately. This is especially so when compared with a baseline algorithm (a straightforward adaptation of ? (?)). On the question 3), the inference of the extracted WFAs are about 1300 times faster than that of the original RNNs. On the question 2), we devised an RNN that models a weighted variant of a (non-regular) language of balanced parentheses. Although this weighted language is beyond WFAs’ expressivity, we found that our method extracts WFAs that successfully approximate the RNN up-to a certain depth bound.
The paper is organized as follows. Angluin’s algorithm and its weighted adaptation are recalled in §2. Our WFA extraction procedure is described in §3 focusing on our novelty, namely the regression-based procedure for answering equivalence queries. Comparison with the DFA extraction by ? (?) is given there, too. In §4 we discuss our experiment results.
Potential Applications
A major potential application of WFA extraction is to analyze an RNN via its interpretable surrogate . The theory of WFAs offers a number of analysis methods, such as bisimulation metric (?)—a distance notion between WFAs—that will allow us to tell how apart two RNNs are.
Another major potential application is as “a poor man’s RNN .” While RNN inference is recognized to be rather expensive (especially for edge devices), simpler models by WFAs should be cheaper to run. Indeed, our experimental results show (§4) that WFA inference is about 1300 times faster than inference of original RNNs.
Since WFAs are defined over a finite alphabet, we restrict to RNNs that take sequences over a finite alphabet. This restriction should not severely limit the applicability of our method. Indeed, such RNNs (over a finite alphabet) have successfully applications in many domains, including intrusion prediction (?), malware detection (?), and DNA-protein binding prediction (?). Moreover, even if inputs are real numbers, quantization is commonly employed without losing a lot of precision. See, e.g., recent (?).
Related Work
The relationship between RNNs and automata has been studied in both non-quantitative (?; ?; ?; ?) and quantitative (?; ?) settings. Some of these works feature automata extraction from RNNs; we shall now discuss recent ones among them.
The work by ? (?) is a pioneer in automata extraction from RNNs. They extract DFAs from RNNs, using a variation of algorithm, much like this work. We provide a systematic comparison in §3.3, identifying some notable similarities and differences.
? (?) extract a WFA from a black-box sequence acceptor whose example is an RNN. Their method does not use equivalence queries; in contrast, we exploit the internal state space of an RNN to approximately answer equivalence queries.
DeepStellar (?) extracts Markov chains from RNNs, and uses them for coverage-based testing and adversarial sample detection. Their extraction, differently from our -like method, uses profiling from the training data and discrete abstraction of the state space.
? (?) propose a new RNN architecture that makes it easier to extract a DFA from a trained model. To apply their method, one has to modify the structure of an RNN before training, while our method does not need any special structure to RNNs and can be applied to already trained RNNs.
? (?) introduce a neural network architecture that can represent (restricted forms of) CNNs and RNNs. WFAs could also be expressed by their architecture, but extraction of automata is out of their interest.
A major approach to optimizing neural networks is by compression: model pruning (?), quantization (?), and distillation (?). Combination and comparison with these techniques is interesting future work.
2 Preliminaries
We fix a finite alphabet . The set of (finite-length) words over is . The empty word (of length [math]) is denoted by . The length of a word is denoted by .
inline,linecolor=orange,backgroundcolor=orange!25,bordercolor=orange]Masaki: In case we omit the definition of DFAs, we can rename the paragraph name to e.g., “Weighted Finite Automata”. We recall basic notions on WFAs. See (?) for details.
Definition 1** (WFA).**
A weighted finite automaton (WFA) over is a quadruple A=\bigl{(}Q_{A},\alpha_{A},\beta_{A},(A_{\sigma})_{\sigma\in\Sigma}). Here is a finite set of states; are row vectors of size called the initial and final vectors; and is a transition matrix of , given for each . For each , is a matrix of size .
Definition 2** (configuration of a WFA).**
Let be the WFA in Def. 1. A configuration of is a row vector . For a word (where ), the configuration of at is defined by \textstyle\delta_{A}(w)=\alpha_{A}^{\top}\cdot\bigl{(}\prod_{i=1}^{n}A_{\sigma_{i}}\bigr{)}.
Obviously is a row vector of size ; it records the weight at each state after reading .
Definition 3** (weight of a word in a WFA).**
Let a WFA and a word be as in Def. 2. The weight of in is given by \textstyle f_{A}(w)=\alpha_{A}^{\top}\cdot\bigl{(}\prod_{i=1}^{n}A_{\sigma_{i}}\bigr{)}\cdot\beta_{A}, multiplying the final vector to the configuration at .
Example 4**.**
Let , , , , A_{a}=\left(\begin{array}[c]{c c c}1&2&-1\\ 3&0&0\\ 0&4&0\\ \end{array}\right), and A_{b}=\left(\begin{array}[c]{c c c}-1&1&0\\ 0&3&0\\ -2&4&0\\ \end{array}\right). For the WFA over , and , the configuration and the weight are as follows.
[TABLE]
[TABLE]
Fig. 1 illustrates the WFA , where the transitions with weight [math] are omitted.
inline,linecolor=orange,backgroundcolor=orange!25,bordercolor=orange]Masaki: If we need more space, we can move Definition 5 to the appendix. In that case, we should also rename the paragraph name.
Definition 5** (DFA).**
A DFA is defined much like in Def. 1, except that
-
the entries of matrices are and ;
-
we replace the use of with , respectively; and
-
we impose determinacy, that exactly one entry is in each row of , and that only one entry is in the initial vector .
The definitions of and in Def. 2–3 adapt to DFAs. For , the configuration vector has exactly one ; the state whose entry is is called the -successor of . is accepted by if .
Recurrent Neural Networks
Our view of a recurrent neural network (RNN) is almost a black-box. We need only the following two operations: feeding an input word and observing its output (a real number); and additionally, observing the internal state (a vector) after feeding a word. This allows us to model RNNs in the following abstract way.
Definition 6** (RNN).**
Let be a natural number called a dimension. A (real-valued) RNN is a triple , where is an initial state, is an output function, and is called a transition function. The set is called a state space.
Definition 7** (RNN configuration , output ).**
Let be the RNN in Def. 6. The transition function naturally extends to words as follows: , defined inductively by and g^{*}_{R}(x,w\sigma)=g_{R}\bigl{(}g^{*}_{R}(x,w),\sigma\bigr{)}, where and .
The configuration of the RNN at a word is defined by . The output , of for the input , is defined by f_{R}(w)=\beta_{R}\bigl{(}\delta_{R}(w)\bigr{)}.
2.1 Angluin’s Algorithm
Angluin’s -algorithm learns a given DFA by a series of membership and equivalence queries. We sketch the algorithm; see (?) for details. Its outline is in Fig. 2.
A membership query is a black-box observation of the DFA : it feeds with a word ; and obtains , i.e., whether is accepted by .
The core of the algorithm is to construct the observation table ; see Fig. 3(a). The table has words as the row and column labels; its entries are either or . The row labels are called access words; the column labels are test words. We let stand for the sets of access and test words. The entry of at row and column is given by —a value that we can obtain from a suitable membership query.
Therefore we extend a table by a series of membership queries. We do so until becomes closed; this is the top loop in Fig. 2. A table is closed if, for any access word and , there is an access word such that
[TABLE]
The closedness condition essentially says that the role of the extended word is already covered by some existing word . The notion of “role” here, formalized in (1), is a restriction of the well-known Myhill–Nerode relation, from all words to .
A closed table induces a DFA (Fig. 2), much like in the Myhill–Nerode theorem. We note that the resulting DFA is necessarily minimized. The DFA undergoes an equivalence query that asks if ; an equivalence query is answered with a counterexample—i.e., such that —if .
The algorithm is a deterministic learning algorithm (at least in its original form), unlike many recent learning algorithms that are statistical. The greatest challenge in practical use of the algorithm is to answer equivalence queries. When is a finite automaton that generates a regular language, there is a complete algorithm for deciding the language equivalence . However, if we use a more expressive model in place of a DFA , checking becomes a nontrivial task.
2.2 Algorithm for WFA Learning
The classic algorithm for learning DFAs has seen a weighted extension (?): it learns a WFA , again via a series of membership and equivalence queries. The overall structure of the WFA learning algorithm stays the same as in Fig. 2; here we highlight major differences.
Firstly, the entries of an observation table are now real numbers, reflecting the fact that the value for a WFA is in instead of in (see Def. 3). An example of an observation table is given in Fig. 3(b).
Secondly, the notion of closedness is adapted to the weighted (i.e., linear-algebraic) setting, as follows. A table is closed if, for any access word and , the vector \bigl{(}\,f_{B}(u\sigma\,v)\,\bigr{)}_{v\in\mathcal{T}}\in\mathbb{R}^{|T|} can be expressed as a linear combination of the vectors in \bigl{\{}\,\bigl{(}\,f_{B}(u^{\prime}v)\,\bigr{)}_{v\in\mathcal{T}}\,\big{|}\,u^{\prime}\in\mathcal{A}\bigr{\}}. Note that the vector \bigl{(}\,f_{B}(u^{\prime}v)\,\bigr{)}_{v\in\mathcal{T}}^{\top} in the latter set is precisely the row vector in at row .
For example, the table in Fig. 3(b) is obviously closed, since the three row vectors are linearly independent and thus span the whole . The above definition of closedness comes natural in view of Def. 2. For a WFA, a configuration (during its execution) is not a single state, but a weighted superposition of states. The closedness condition asserts that the role of is covered by a suitable superposition of words . The construction of the WFA from a closed table (see Fig. 2) reflects this intuition. See (?). We note that the resulting is minimal, much like in §2.1.
In the literature (?), an observation table is presented as a so-called Hankel matrix. This opens the way to further extensions of the method, such as an approximate learning algorithm via the singular-value decomposition (SVD).
3 WFA Extraction from an RNN
inline,linecolor=red,backgroundcolor=red!25,bordercolor=red]TODO: The column width exceeds the limit because of the procedure definition!!!
We present our main contribution, namely a procedure that extracts a WFA from a given RNN. After briefly describing its outline (that is the same as Fig. 2), we focus on the greatest challenge of answering equivalence queries.
3.1 Procedure Outline
Our procedure uses the weighted algorithm sketched in §2.2. As we discussed in §2.1, the greatest challenge is how to answer equivalence queries; our novel approach is to use regression and synthesize what we call a configuration abstraction function . See Fig. 4.
The outline of our procedure thus stays the same as in Fig. 2, but we need to take care about noisy outputs from an RNN because they prevent the observation table from being precisely closed (in the sense of §2.2). To resolve this issue, we use a noise-tolerant algorithm (?) which approximately determines whether the observation table is closed. This approximate algorithm employs SVD and cuts off singular values that are smaller than a threshold called rank tolerance. In the choice of a rank tolerance, we face the trade off between accuracy and regularity. If the rank tolerance is large, the WFA learning algorithm tends to ignore the counterexample given by an equivalence query and results in producing an inaccurate WFA. We use a heuristic to decrease the rank tolerance when two or more equivalence queries return the same counterexample. See Appendix A.1 for details.
3.2 Equivalence Queries for WFAs and RNNs
Algorithm 1 shows our procedure to answer an equivalence query. The procedure Ans-EqQ is the main procedure, and it returns either or a counterexample word (as in Fig. 2). It calls the auxiliary procedure Consistent?, which decides if we refine the current configuration abstraction function in Fig. 4 (Line 12).
Best-First Search for a Counterexample
The procedure Ans-EqQ is essentially a best-first search for a counterexample, that is, a word such that the difference of the output values from the RNN and from the WFA is larger than error tolerance . We first outline Ans-EqQ and then go into the technical detail.
We manage counterexample candidates by the priority queue , which gives a higher priority to a candidate more likely to be a counterexample. The already investigated words are in the set . The queue initially contains only the empty word (Line 5).
We search for a counterexample in the main loop starting from Line 6. Let be a word popped from , that is, the candidate most likely to be a proper counterexample among the words in the queue. If is a counterexample, Ans-EqQ returns it (Lines 8–9). Otherwise, after refining the configuration abstraction function (Lines 10–12), new candidates , the extension of with character , are pushed to with their priorities only if the neighborhood of in the state space of the WFA does not contain sufficiently many already investigated words—i.e., only if it has not been investigated sufficiently (Lines 15–18). This is because, if the neighborhood of has been investigated sufficiently, we expect that the neighborhoods of the new candidates have also been investigated and, therefore, that the words do not have to be investigated furthermore. We use : 1) to decide if the neighborhood of has been investigated sufficiently (Line 15); and 2) to calculate the priorities of the new candidates (Line 17). Note that we add to in Line 13 since it has been investigated there. If all the candidates are not a counterexample, Ans-EqQ returns (Line 19).
Configuration Abstraction Function
To use for the above purposes, the property we expect from the configuration abstraction function is as follows:
[TABLE]
See Def. 2 and 7 for and , respectively. To synthesize such a function , we employ regression using the data \bigl{\{}\,\bigl{(}\delta_{R}(h^{\prime}),\delta_{A}(h^{\prime})\bigr{)}\in\mathbb{R}^{d}\times\mathbb{R}^{Q_{A}}\,\big{|}\,h\in\mathtt{visited}\,\bigr{\}}. See Line 12. Note that we can use any regression method to learn .
We refine during the best-first counterexample search. Specifically, in Line 10, we use the procedure Consistent? to check if the current —obtained by regression—is consistent with a counterexample candidate . The consistency means that and are close to each other, which is formalized by the relation defined later. If the check fails (i.e., if Consistent? returns ), we refine by regression to make consistent with (and the already investigated words in ). See Line 12.
Consistency Checking by Consistent?
The procedure Consistent? in Line 10 is defined as follows: it returns if there exists such that
[TABLE]
and returns otherwise. The basic idea of Consistent? is to return if because it means the violation of the desired property (2). However, to reduce the run-time cost of refining and to prevent learning from outliers, we adopt the alternative approach presented above, which is taken from ? (?).
The existence of satisfying the condition (3) approximates the violation of the property (2) in the following sense. If there is a word satisfying the first part of the condition (3), has to be refined because we find the property (2) violated with . The second part of the condition (3) means that seems to behave similarly to according to the configuration abstraction function . We expect the second part to prevent from being refined with outliers because the neighborhoods of the words used for refining must have been investigated twice or more.
Equivalence Relation
For a given WFA , we define the relation in the configuration space by
[TABLE]
where is the final vector of the WFA . It satisfies the following.
If holds, the difference of the output values for the configurations and of the WFA is smaller than the error tolerance , that is, . 2. 2.
If the -th element of the final vector becomes large, the neighborhood of shrinks in the direction of the -th axis. 3. 3.
The neighborhood defined by is an ellipsoid—a reasonable variant of an open ball as a neighborhood.
A Heuristic for Equivalence Checking of a WFA and an RNN
Although the best-first search above works well, we introduce an additional heuristic to improve the run-time performance of our algorithm furthermore. The heuristic deems and to be equivalent if word previously popped from is so long that it is impossible to occur in the training of the RNN. This heuristic is based on the expectation that, when an impossible word is the most likely to be a counterexample, all possible words are unlikely to be a counterexample, and so and are considered to be equivalent. This heuristic is adopted immediately after popping (Line 7), as follows: we suppose that the maximum length of possible words is given; and, if the length of is larger than , Ans-EqQ returns . We confirm the usefulness of this heuristic in §4 empirically.
Termination of the Procedure
Algorithm 1 does not always terminate in a finite amount of time. If the procedure does not find any counterexample at Line 9 and the points are so scattered in the configuration space that the value at Line 15 is always small, words are always pushed to at Line 18. In that case, the condition to exit the main loop at Line 6 is never satisfied.
3.3 Comparison with Weiss et al., 2018
Our WFA extraction method can be seen as a weighted extension of the -based procedure (?) to extract a DFA from an RNN. Note that a WFA defines a function of type and a DFA defines a function of type .
The main technical novelty of the method in (?) is how to answer equivalence queries. It features the clustering of the state space of an RNN into finitely many clusters, using support-vector machines (SVMs). Each cluster of is associated with a state of the DFA.
Our theoretical observation is that such clustering amounts to giving a function . Moreover, for a DFA, is the configuration space (as well as the state space, see Def. 5). Therefore, our WFA extraction method can be seen as an extension of the DFA extraction procedure in (?).
4 Experiments
inline,linecolor=green,backgroundcolor=green!25,bordercolor=green]Taro: Algorithm or procedure? procedure! (Sep 5 22:32)
We conducted experiments to evaluate the utility of our regression-based WFA extraction method. Specifically, we pose the following questions.
RQ1
Do the WFAs extracted by our algorithm approximate the original RNNs accurately?
RQ2
Does our algorithm work well with RNNs whose expressivity is beyond WFAs?
RQ3
Do the extracted WFAs run more efficiently, specifically in inference time, than given RNNs?
For RQ1, we compared the performance with a baseline algorithm (a straightforward adaptation of ? (?)’s algorithm). Here we focused on “automata-like” RNNs, that is, those RNNs trained from an original WFA . For RQ2, we used an RNN that exhibits “context-free” behaviors.
Experimental Setting We implemented our method in Python. We write for the algorithm where the concentration threshold in Ans-EqQ is set to ; other parameters are fixed as follows: error tolerance and heuristic parameter . We adopt the Gaussian process regression (GPR) provided by scikit-learn as a configuration abstraction function (we also tried the kernel ridge regression but GPR worked empirically better). Throughout the experiments, our RNNs are 2-layer LSTM networks with dimension size 50, implemented by TensorFlow. The experiments were on a g3s.xlarge instance on Amazon Web Service (May 2019), with a NVIDIA Tesla M60 GPU, 4 vCPUs of Xeon E5-2686 v4 (Broadwell), and 8GiB memory.
4.1 RQ1: Extraction from RNNs Modeling WFAs
This experiment examines how well our algorithm work for RNNs modeling WFAs. To do so, we first train RNNs using randomly generated WFAs; we call those WFAs the origins. Then, we evaluate our algorithm compared with a baseline from two points: accuracy of the extracted WFAs against the trained RNNs and running times of the algorithms. We report the results after presenting the details of the baseline, how to train RNNs, and how to evaluate the two algorithms.
The Baseline Algorithm: As our extraction algorithm, the baseline algorithm , which is parameterized over an integer , is a straightforward adaptation of ? (?)’s algorithm. The difference is that equivalence queries in the baseline are implemented in breath-first search, as follows. Let be a given RNN and be a WFA being constructed. For each equivalence query, the baseline searches for a word such that (where ), in the breadth-first manner. If such a word is found it is returned as a counterexample. The search is restricted to the first words, where is the index of the counterexample word found in the previous equivalence query. If no counterexample is found within this search space, the baseline algorithm deems to be equivalent to . Obviously, if is larger, more counterexample candidates are investigated.
inline,linecolor=green,backgroundcolor=green!25,bordercolor=green]Taro: is different from the actual implementation.
Target RNNs trained from WFAs Table 1 reports the accuracy of the extracted WFAs , where the target RNNs are obtained from original WFAs in the following manner. Given an alphabet of a designated size (the leftmost column), we first generated a WFA such that 1) the state-space size is as designated (the leftmost column), and 2) the initial vector , the final vector , and the transition matrix are randomly chosen (with normalization so that its outputs are in ). Then we constructed a dataset by sampling 9000 words such that and ; this was used to train an RNN , on the set of input-output pairs of and for 10 epochs.
A simple way to sample words is by the uniform distribution. With covering the input space of the WFA uniformly, we expect the resulting RNN to inherit properties of well. The top table in Table 1 reports the results in this “more WFA-like” setting.
However, in many applications, the input domain of the data used for training RNNs are nonuniform, sometimes even excluding some specific patterns. To evaluate our method in such realistic settings, we conducted another set of experiments whose results are in the bottom of Table 1. Specifically, for training from , we used a dataset that only contains those words which satisfy the following condition: if (), then . For example, for , and may be in , but may not.
Evaluation In order to evaluate accuracy, we calculated the mean square error (MSE) of the extracted WFA against the RNN , using a dataset of words sampled from an appropriate distribution, namely the one used in training the RNN from . The dataset is sampled so that it does not overlap with the training dataset for .
Results and Discussions In the experiments in Table 1, we considered 8 configurations for generating the original WFA (the leftmost column). The unit of MSEs are —given also that the outputs of the original WFAs are normalized to , we can say that the MSEs are small enough.
In the top table in Table 1 (the “more WFA-like” setting), and achieved the first- and second-best performance in terms of accuracy, respectively (see the “Total” row). More generally, we can find the trend that, as an extraction runs longer, it performs better. We conjecture its reason as follows. Recall that all the RNNs are trained on words sampled from the uniform distribution. This means that all words would be somewhat informative to approximate the RNNs. As a result, the performance is more influenced by the amount of counterexamples—i.e., how long time extraction takes—than on their “qualities.”
The exception of this trend is , which took a longer time but performed worse than , , and . In particular, performed well for smaller alphabets () but not so when . The role of the parameter in (i.e., in Algorithm 1) is a threshold to control how many words configuration regions of a WFA are investigated with. Thus, we conjecture that the use of too small limits the input space to be investigated excessively, which is more critical as the input space is larger, eventually biasing the counterexamples (in Algorithm 1), though the RNNs are trained on the uniform distribution, and making refinement of WFAs less effective.
In the bottom table in Table 1 (the “realistic” setting), performs significantly better than the other (and the best among all the procedures) in terms of accuracy. This is the case even for a large alphabet (). This indicates that, in the cases that an RNN is trained with a nonuniform dataset, making the investigated input space larger by big could even degrade the accuracy performance. A possible reason for this degradation is as follows. Some words (such as ) are prohibited in the sample set , and the behaviors of the RNN for those prohibited words are unexpected. Therefore, those prohibited words should not be useful to refine a WFA. The use of small could prevent such meaningless (or even harmful) counterexamples from being investigated. This discussion raises another question: how can we find an optimal ? We leave it as future work.
Let us briefly discuss the sizes of the extracted WFAs. The general trend is that the extracted WFAs have a few times greater number of states than the original WFAs used in training . For example, in the setting of the top table in Table 1, for and , the average number of the states of the extracted was 38.2.
4.2 RQ2: Expressivity beyond WFAs
We conducted experiments to examine how well our method works for RNNs modeling languages that cannot be expressed by any WFA. Specifically, we used an RNN that models the following function : , if all the parentheses in are balanced (here is the depth of the deepest balanced parentheses in ); and otherwise. This is a weighted variant of a (non-regular) language of balanced parentheses. For instance, , , and .
We trained an RNN as follows. We generated datasets and , and trained an RNN on the set of input-output pairs of and . The dataset consists of randomly generated words where all the parentheses are balanced; is constructed similarly, except that we apply suitable mutation to each word, which most likely makes the parentheses unbalanced. See Appendix B.1 for details.
Fig. 5 shows the WFAs extracted from . Remarkable observations here are as follows.
- •
The shapes of the WFAs—obtained by ignoring clearly negligible weights—give rise to NFAs that recognize balanced parentheses up-to a certain depth.
- •
As the parameter in grows, the recognizable depth bound grows: depth one with ; and depth two with .
We believe these observations demonstrate important features, as well as limitations, of our method. Overall, the extracted WFAs expose interpretable structures hidden in an RNN: the NFA structures in Fig. 5 are human-interpretable (they are easily seen to encode bounded balancedness) and machine-processable (such as determinization and minimization). It is also suggested that the parameter gives us flexibility in the trade-off of extraction cost and accuracy. At the same time, we can only obtain a truncated and regularized version of the RNN structure—this is an inevitable limitation as long as we use the formalism of WFAs.
We also note that, in each of the two extracted WFAs, the transition matrices are similar for all (the entries at the same position have the same order). This is as expected, too, since the function does not distinguish the characters .
4.3 RQ3: Accelerating Inference Time
We conducted experiments about inference time, comparing the original RNNs and the WFAs that we extracted from . We used the same RNNs and WFAs as in §4.1, where the latter are extracted using and . We note that the inference of RNNs utilizes GPUs while that of WFAs is solely done by CPUs.
We observed that the inference time of the extracted WFAs was about 1,300 times faster than the target RNNs , taking the average over different settings (Appendix B.3). This demonstrates the potential use of the extracted WFAs as a computationally cheaper surrogate for RNNs. We attribute the acceleration to the following: 1) WFAs use only linear computation while RNNs involve nonlinear ones; and 2) overall, extracted WFAs are smaller in size. Provided that the accuracy of extracted WFAs can be high (as we observed in §4.1), we believe the replacement of RNNs by WFAs is a viable option in some application scenarios.
5 Conclusions and Future Work
We proposed a method that extracts a WFA from an RNN, focusing on RNNs that take a word and return a real value. We used regression to investigate and abstract the internal states of RNNs. We experimentally evaluated our method, comparing its performance with a baseline whose equivalence queries are based on simple breadth-first search.
One future work is a detailed comparison with other methods for model compression. Another future work is to use machine learning methods to find a counterexample in the equivalence query, such as reinforcement learning (?) adversarial attacks (?), and acquisition functions of GPR. Finally, we need a means to optimize parameter of our method for a specific problem. It may also be helpful to extend our method so that the investigated words can be restricted to a fixed language ; If identifies the input space of the training dataset for RNNs, we could avoid investigating the input space on which the RNNs are not trained, and therefore we could seek only “meaningful” counterexamples even in using large .
6 Acknowledgments
Thanks are due to Mahito Sugiyama and the anonymous reviewers of AAAI for a lot of useful comments. This work is partially supported by JST ERATO HASUO Metamathematics for Systems Design Project (No. JPMJER1603), JSPS KAKENHI Grant Numbers JP15KT0012, JP18J22498, JP19K20247, JP19K22842, and JST-Mirai Program Grant Number JPMJMI18BA, Japan.
Appendix A Detail of Our WFA Extraction
A.1 On Rank Tolerance
Construction step calculates a minimal WFA that is compatible with observed data . It relies on rank calculation, and the calculation is done by computing the SVD of the matrix and counting the number of non-zero singular values. The threshold to check whether the singular value is zero or not is called rank tolerance. A small rank tolerance results in accurate learning basically but can cause overfitting for short words and huge error for long words. A large rank tolerance results in rough learning but prevents such overfitting. To balance the rank tolerance, we start from a big initial rank tolerance , and if it is too big then we decay it by multiplying . We know that the rank tolerance is too big if the equivalence query returns the same counterexample twice because it means the counterexample was ignored. Overall, we obtain the WFA Extraction procedure (Algorithm 2).
Appendix B Detail of the Experiments
B.1 On Training Data Generation for
We made the training data for in this manner.
We make 5000 words of random balanced parentheses made only of . There is a one-to-one correspondence between words of balanced parentheses of length and paths from the bottom-left to the top-right in the grid of size whose bottom-right half is removed, so we can obtain such random words by generating the paths randomly and converting them into the words. For example, “(())” or “(()())” can be made. 2. 2.
We insert random characters in into the words generated in Step 1. This generates 5000 words of random balanced words made of . For example, “(0(1))” or “((12340)())” can be made. 3. 3.
We run the same procedure as Step 1 and obtain 5000 words of random balanced parentheses. 4. 4.
We mutate the words in Step 3 and make them into 5000 random unbalanced parentheses made only of . The mutation rules are as follows: 1) duplicate a random character; 2) delete a random character; and 3) exchange a random pair of adjacent characters. These rules are repeatedly applied—each time throwing a fair coin—until we get the head of the coin. Note that the mutation can make a balanced word into another balanced word. For example, “(()”, “((((”, or “()” can be made (only the last one is balanced). 5. 5.
We insert random characters in into the words generated in Step 4. This generates 5000 words of random unbalanced words made of . 6. 6.
We combine the result of Step 2 and 5 and get 10000 words. Almost the half of the words are balanced and the other half are unbalanced. We pick 9000 random words from the words and use them as the training data; the remaining 1000 are used as the test data.
B.2 Detailed WFAs Extracted from
The WFA Extracted by
Fig. 6 illustrates the WFA extracted from the RNN trained by by . The initial and final vectors, and the transition matrices are in Fig. 7 and 8.
The WFA Extracted by
Fig. 9 illustrates the WFA extracted from the RNN trained by by . The initial and final vectors, and the transition matrices are in Fig. 10, 11, and 12.
B.3 Inference Time of the Target RNNs and the Extracted WFAs
On average, the inference time of the target RNNs was 29.97519233 milliseconds, while that of the extracted WFAs was 0.023052549 milliseconds. Therefore, on average, the inference of the extracted WFAs was about 1300.298397 times faster than that of the target RNNs.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[Angluin 1987] Angluin, D. 1987. Learning regular sets from queries and counterexamples. Inf. Comput. 75(2):87–106.
- 2[Ayache, Eyraud, and Goudian 2018] Ayache, S.; Eyraud, R.; and Goudian, N. 2018. Explaining black boxes on sequential data using weighted automata. In Unold, O.; Dyrka, W.; and Wieczorek, W., eds., Proc. ICGI 2018 , volume 93 of Proceedings of Machine Learning Research , 81–103. PMLR.
- 3[Baier and Katoen 2008] Baier, C., and Katoen, J.-P. 2008. Principles of Model Checking . The MIT Press.
- 4[Balle and Mohri 2015] Balle, B., and Mohri, M. 2015. Learning weighted automata. In Maletti, A., ed., Proc. CAI 2015 , volume 9270 of Lecture Notes in Computer Science , 1–21. Springer.
- 5[Balle, Gourdeau, and Panangaden 2017] Balle, B.; Gourdeau, P.; and Panangaden, P. 2017. Bisimulation metrics for weighted automata. In Chatzigiannakis, I.; Indyk, P.; Kuhn, F.; and Muscholl, A., eds., Proc. ICALP 2017 , volume 80 of LIP Ics , 103:1–103:14. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik.
- 6[Bramer 2013] Bramer, M. 2013. Ensemble Classification . London: Springer London. 209–220.
- 7[Bucila, Caruana, and Niculescu-Mizil 2006] Bucila, C.; Caruana, R.; and Niculescu-Mizil, A. 2006. Model compression. In Proc. KDD 2006 , 535–541.
- 8[Chaudhuri 2019] Chaudhuri, A. 2019. Visual and Text Sentiment Analysis through Hierarchical Deep Learning Networks . Springer Briefs in Computer Science. Springer.
