Weighted Automata Extraction from Recurrent Neural Networks via   Regression on State Spaces

Takamasa Okudono; Masaki Waga; Taro Sekiyama; Ichiro Hasuo

arXiv:1904.02931·cs.LG·November 21, 2019

Weighted Automata Extraction from Recurrent Neural Networks via Regression on State Spaces

Takamasa Okudono, Masaki Waga, Taro Sekiyama, Ichiro Hasuo

PDF

TL;DR

This paper introduces a novel method for extracting weighted finite automata from RNNs using regression techniques, enhancing the interpretability and analysis of neural network internal states.

Contribution

It extends existing automaton extraction methods by incorporating regression for equivalence queries, enabling weighted automaton extraction from RNNs.

Findings

01

High accuracy in automaton extraction

02

Improved expressivity over previous DFA-based methods

03

Efficient extraction process demonstrated

Abstract

We present a method to extract a weighted finite automaton (WFA) from a recurrent neural network (RNN). Our algorithm is based on the WFA learning algorithm by Balle and Mohri, which is in turn an extension of Angluin's classic \lstar algorithm. Our technical novelty is in the use of \emph{regression} methods for the so-called equivalence queries, thus exploiting the internal state space of an RNN to prioritize counterexample candidates. This way we achieve a quantitative/weighted extension of the recent work by Weiss, Goldberg and Yahav that extracts DFAs. We experimentally evaluate the accuracy, expressivity and efficiency of the extracted WFAs.

Tables1

Table 1. Table 1: Experiment results, where we extracted a WFA A 𝐴 A from an RNN R 𝑅 R that is trained to mimic the original WFA A ∙ superscript 𝐴 ∙ A^{\bullet} . In each cell “n/m”, “n” denotes the average of MSEs between A 𝐴 A and R 𝑅 R (the unit is 10 − 4 superscript 10 4 10^{-4} ), taken over five random WFAs A ∙ superscript 𝐴 ∙ A^{\bullet} of the designated alphabet size | Σ | Σ |\Sigma| and the state-space size | Q A ∙ | subscript 𝑄 superscript 𝐴 ∙ |Q_{A^{\bullet}}| . “m” denotes the average running time (the unit is second). The “Total” row describes the average over all the experiment settings. The highlighted cell designates the best performer in terms of errors. Timeout was set at 10,000 sec. 𝐑𝐆𝐑 ( 2 – 5 ) 𝐑𝐆𝐑 2 – 5 \mathbf{RGR}(2\text{--}5) are our regression-based methods; 𝐁𝐅𝐒 ( 500 – 5000 ) 𝐁𝐅𝐒 500 – 5000 \mathbf{BFS}(500\text{--}5000) are the baseline. 𝐁𝐅𝐒 ( 5000 ) 𝐁𝐅𝐒 5000 \mathbf{BFS}(5000) is added to compare the accuracy when the running time is much longer.

$(\| Σ \|, \| Q_{A^{∙}} \|)$	$𝐑𝐆𝐑 (2)$	$𝐑𝐆𝐑 (5)$	$𝐁𝐅𝐒 (500)$	$𝐁𝐅𝐒 (1000)$	$𝐁𝐅𝐒 (2000)$	$𝐁𝐅𝐒 (3000)$	$𝐁𝐅𝐒 (5000)$
$(4, 10)$	2.17 / 286	2.39 / 338	26.8 / 165	9.77 / 279	4.36 / 545	4.07 / 716	2.33 / 1390
$(6, 10)$	2.45 / 1787	2.54 / 1302	6.99 / 386	4.48 / 641	4.08 / 1218	3.15 / 1410	2.28 / 2480
$(10, 10)$	4.68 / 7462	4.46 / 5311	22.5 / 928	11.9 / 1562	5.90 / 3521	4.55 / 3638	3.55 / 5571
$(10, 15)$	5.62 / 8941	5.78 / 8564	21.2 / 2155	10.6 / 4750	7.87 / 5692	5.71 / 7344	5.27 / 7612
$(10, 20)$	3.70 / 7610	3.79 / 7799	6.24 / 2465	10.1 / 2188	6.13 / 3106	3.70 / 5729	3.63 / 7473
$(15, 10)$	7.34 / 9569	5.52 / 10000	13.5 / 3227	8.01 / 6765	6.07 / 7916	5.98 / 8911	6.17 / 8979
$(15, 15)$	8.44 / 10000	5.58 / 9981	16.3 / 2675	9.24 / 4850	7.28 / 5135	9.88 / 7204	6.44 / 8425
$(15, 20)$	9.16 / 7344	5.15 / 7857	13.7 / 2224	7.26 / 3823	6.60 / 5744	4.96 / 5674	4.01 / 9464
Total	5.45 / 6625	4.40 / 6394	15.9 / 1778	8.92 / 3107	6.04 / 4110	5.25 / 5078	4.21 / 6549

Equations49

δ_{A} (w)

δ_{A} (w)

= (123) - 1 0 - 2 134000 130204 - 1 00

= (50 - 14 7)

f_{A} (w) = α_{A}^{⊤} A_{b} A_{a} β_{A}

f_{A} (w) = α_{A}^{⊤} A_{b} A_{a} β_{A}

= (123) - 1 0 - 2 134000 130204 - 1 00 0 - 1 1

= 21

f_{B} (u σ v) = f_{B} (u^{'} v) for each test word v \in T .

f_{B} (u σ v) = f_{B} (u^{'} v) for each test word v \in T .

p (δ_{R} (h)) \approx δ_{A} (h) for as many h \in Σ^{*} as possible.

p (δ_{R} (h)) \approx δ_{A} (h) for as many h \in Σ^{*} as possible.

δ_{A} (h^{'}) \neq ≃_{A} p (δ_{R} (h^{'})) and p (δ_{R} (h^{'})) ≃_{A} p (δ_{R} (h)),

δ_{A} (h^{'}) \neq ≃_{A} p (δ_{R} (h^{'})) and p (δ_{R} (h^{'})) ≃_{A} p (δ_{R} (h)),

x ≃_{A} y ⟺ \sum_{i = 1}^{∣ Q_{A} ∣} β_{i}^{2} (x_{i} - y_{i})^{2} < \frac{e ^{2}}{∣ Q _{A} ∣},

x ≃_{A} y ⟺ \sum_{i = 1}^{∣ Q_{A} ∣} β_{i}^{2} (x_{i} - y_{i})^{2} < \frac{e ^{2}}{∣ Q _{A} ∣},

α_{A}

α_{A}

β_{A}

A_{(}

A_{(}

A_{)}

A_{0}

A_{1}

A_{2}

A_{3}

A_{4}

A_{5}

A_{6}

A_{7}

A_{8}

A_{9}

α_{A}

α_{A}

β_{A}

A_{(}

A_{(}

A_{)}

A_{0}

A_{1}

A_{2}

A_{3}

A_{4}

A_{5}

A_{5}

A_{6}

A_{7}

A_{8}

A_{9}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Weighted Automata Extraction from Recurrent Neural Networks

via Regression on State Spaces

Takamasa Okudono, Masaki Waga, Taro Sekiyama, Ichiro Hasuo

National Institute of Informatics & The Graduate University for Advanced Studies

2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430

{tokudono,mwaga,sekiyama,hasuo}@nii.ac.jp

Abstract

We present a method to extract a weighted finite automaton (WFA) from a recurrent neural network (RNN). Our method is based on the WFA learning algorithm by Balle and Mohri, which is in turn an extension of Angluin’s classic $\mathrm{L}^{*}$ algorithm. Our technical novelty is in the use of regression methods for the so-called equivalence queries, thus exploiting the internal state space of an RNN to prioritize counterexample candidates. This way we achieve a quantitative/weighted extension of the recent work by Weiss, Goldberg and Yahav that extracts DFAs. We experimentally evaluate the accuracy, expressivity and efficiency of the extracted WFAs.

inline,linecolor=orange,backgroundcolor=orange!25,bordercolor=orange]Masaki: I changed the comments to “inline” because the margin is too small to use “not-inline” mode.

1 Introduction

Background

Deep neural networks (DNNs) have been successfully applied to domains such as text, speech, and image processing. Recurrent neural networks (RNNs) (?; ?) is a class of DNNs equipped with the capability of processing sequential data of variable length. The great success of RNNs has been seen in, e.g., machine translation (?), speech recognition (?), and anomaly detection (?; ?).

While it has been experimentally shown that RNNs are a powerful tool to process, predict, and model sequential data, there are known drawbacks in RNNs such as interpretability and costly inference. A research line that attacks this challenge is automata extraction (?; ?). Focusing on RNNs’ use as acceptors (i.e., receiving an input sequence and producing a single Boolean output), these works extract a finite-state automaton from an RNN as a succinct and interpretable surrogate. Automata extraction exposes internal transition between the states of an RNN in the form of an automaton, which is then amenable to algorithmic analyses such as reachability and model checking (?). Automata extraction can also be seen as model compression: finite-state automata are usually more compact, and cheaper to run, than neural networks.

Extracting WFAs from RNNs

Most of the existing automata extraction techniques target Boolean-output RNNs, which however excludes many applications. In sentiment analysis, it is desired to know the quantitative strength of sentiment, besides its (Boolean) existence (?). RNNs with real values as their output are also useful in classification tasks. For example, predicting class probabilities is a key in some approaches to semi-supervised learning (?) and ensemble (?).

This motivates extraction of quantitative finite-state machines as abstraction of RNNs. We find the formalism of weighted finite automata (WFAs) suited for this purpose. A WFA is a finite-state machine—much like a deterministic finite automaton (DFA)—but its transitions as well as acceptance values are real numbers (instead of Booleans).

Contribution: Regression-Based WFA Extraction from RNNs

Our main contribution is a procedure that takes a (real-output) RNN $R$ , and returns a WFA $A_{R}$ that abstracts $R$ . The procedure is based on the WFA learning algorithm in (?), that is in turn based on the famous $\mathrm{L}^{*}$ algorithm for learning DFAs (?). These algorithms learn automata by a series of so-called membership queries and equivalence queries. In our procedure, a membership query is implemented by an inference of the given RNN $R$ . We iterate membership queries and use their results to construct a WFA $A$ and to grow it.

The role of equivalence queries is to say when to stop this iteration: it asks if $A$ and $R$ are “equivalent,” that is, if the WFA $A$ obtained so far is comprehensive enough to cover all the possible behaviors of $R$ . This is not possible in general—RNNs are more expressive than WFAs—therefore we inevitably resort to an approximate method. Our technical novelty lies in the method for answering equivalence queries; notably it uses a regression method—e.g., the Gaussian process regression (GPR) and the kernel ridge regression (KRR)—for abstraction of the state space of $R$ .

We conducted experiments to evaluate the effectiveness of our approach. In particular, we are concerned with the following questions:

how similar the behavior of the extracted WFA $A_{R}$ , and that of the original RNN $R$ , are;
how applicable our method is to an RNN that is a more expressive model than WFAs; and
how efficient the inference of the WFA $A_{R}$ is, when compared with the inference of $R$ .

The experiments we designed for the questions 1) and 3) are with RNNs trained using randomly generated WFAs. The results show, on the question 1), that the WFAs extracted by our method approximate the original RNNs accurately. This is especially so when compared with a baseline algorithm (a straightforward adaptation of ? (?)). On the question 3), the inference of the extracted WFAs are about 1300 times faster than that of the original RNNs. On the question 2), we devised an RNN that models a weighted variant of a (non-regular) language of balanced parentheses. Although this weighted language is beyond WFAs’ expressivity, we found that our method extracts WFAs that successfully approximate the RNN up-to a certain depth bound.

The paper is organized as follows. Angluin’s $\mathrm{L}^{*}$ algorithm and its weighted adaptation are recalled in §2. Our WFA extraction procedure is described in §3 focusing on our novelty, namely the regression-based procedure for answering equivalence queries. Comparison with the DFA extraction by ? (?) is given there, too. In §4 we discuss our experiment results.

Potential Applications

A major potential application of WFA extraction is to analyze an RNN $R$ via its interpretable surrogate $A_{R}$ . The theory of WFAs offers a number of analysis methods, such as bisimulation metric (?)—a distance notion between WFAs—that will allow us to tell how apart two RNNs are.

Another major potential application is as “a poor man’s RNN $R$ .” While RNN inference is recognized to be rather expensive (especially for edge devices), simpler models by WFAs should be cheaper to run. Indeed, our experimental results show (§4) that WFA inference is about 1300 times faster than inference of original RNNs.

Since WFAs are defined over a finite alphabet, we restrict to RNNs $R$ that take sequences over a finite alphabet. This restriction should not severely limit the applicability of our method. Indeed, such RNNs (over a finite alphabet) have successfully applications in many domains, including intrusion prediction (?), malware detection (?), and DNA-protein binding prediction (?). Moreover, even if inputs are real numbers, quantization is commonly employed without losing a lot of precision. See, e.g., recent (?).

Related Work

The relationship between RNNs and automata has been studied in both non-quantitative (?; ?; ?; ?) and quantitative (?; ?) settings. Some of these works feature automata extraction from RNNs; we shall now discuss recent ones among them.

The work by ? (?) is a pioneer in automata extraction from RNNs. They extract DFAs from RNNs, using a variation of $\mathrm{L}^{*}$ algorithm, much like this work. We provide a systematic comparison in §3.3, identifying some notable similarities and differences.

? (?) extract a WFA from a black-box sequence acceptor whose example is an RNN. Their method does not use equivalence queries; in contrast, we exploit the internal state space of an RNN to approximately answer equivalence queries.

DeepStellar (?) extracts Markov chains from RNNs, and uses them for coverage-based testing and adversarial sample detection. Their extraction, differently from our $\mathrm{L}^{*}$ -like method, uses profiling from the training data and discrete abstraction of the state space.

? (?) propose a new RNN architecture that makes it easier to extract a DFA from a trained model. To apply their method, one has to modify the structure of an RNN before training, while our method does not need any special structure to RNNs and can be applied to already trained RNNs.

? (?) introduce a neural network architecture that can represent (restricted forms of) CNNs and RNNs. WFAs could also be expressed by their architecture, but extraction of automata is out of their interest.

A major approach to optimizing neural networks is by compression: model pruning (?), quantization (?), and distillation (?). Combination and comparison with these techniques is interesting future work.

2 Preliminaries

We fix a finite alphabet $\Sigma$ . The set of (finite-length) words over $\Sigma$ is $\Sigma^{*}$ . The empty word (of length [math]) is denoted by $\varepsilon$ . The length of a word $w\in\Sigma^{*}$ is denoted by $|w|$ .

inline,linecolor=orange,backgroundcolor=orange!25,bordercolor=orange]Masaki: In case we omit the definition of DFAs, we can rename the paragraph name to e.g., “Weighted Finite Automata”. We recall basic notions on WFAs. See (?) for details.

Definition 1 (WFA).

A weighted finite automaton (WFA) over $\Sigma$ is a quadruple $A=\bigl{(}Q_{A},\alpha_{A},\beta_{A},(A_{\sigma})_{\sigma\in\Sigma})$ . Here $Q_{A}$ is a finite set of states; $\alpha_{A},\beta_{A}$ are row vectors of size $|Q_{A}|$ called the initial and final vectors; and $A_{\sigma}$ is a transition matrix of $\sigma$ , given for each $\sigma\in\Sigma$ . For each $\sigma\in\Sigma$ , $A_{\sigma}$ is a matrix of size $|Q_{A}|\times|Q_{A}|$ .

Definition 2 (configuration of a WFA).

Let $A$ be the WFA in Def. 1. A configuration of $A$ is a row vector $x\in\mathbb{R}^{Q_{A}}$ . For a word $w=\sigma_{1}\sigma_{2}\dotsc\sigma_{n}\in\Sigma^{*}$ (where $\sigma_{i}\in\Sigma$ ), the configuration of $A$ at $w$ is defined by $\textstyle\delta_{A}(w)=\alpha_{A}^{\top}\cdot\bigl{(}\prod_{i=1}^{n}A_{\sigma_{i}}\bigr{)}$ .

Obviously $\delta_{A}(w)\in\mathbb{R}^{Q_{A}}$ is a row vector of size $|Q_{A}|$ ; it records the weight at each state $q\in Q_{A}$ after reading $w$ .

Definition 3 (weight $f_{A}(w)$ of a word in a WFA).

Let a WFA $A$ and a word $w=\sigma_{1}\dotsc\sigma_{n}$ be as in Def. 2. The weight of $w$ in $A$ is given by $\textstyle f_{A}(w)=\alpha_{A}^{\top}\cdot\bigl{(}\prod_{i=1}^{n}A_{\sigma_{i}}\bigr{)}\cdot\beta_{A}$ , multiplying the final vector to the configuration at $w$ .

Example 4.

Let $\Sigma=\{a,b\}$ , $Q_{A}=\{q_{1},q_{2},q_{3}\}$ , $\alpha_{A}=(1\ 2\ 3)^{\top}$ , $\beta_{A}=(0\ {-1}\ 1)^{\top}$ , $A_{a}=\left(\begin{array}[c]{c c c}1&2&-1\\ 3&0&0\\ 0&4&0\\ \end{array}\right)$ , and $A_{b}=\left(\begin{array}[c]{c c c}-1&1&0\\ 0&3&0\\ -2&4&0\\ \end{array}\right)$ . For the WFA $A=(Q_{A},\alpha_{A},\beta_{A},(A_{\sigma})_{\sigma\in\Sigma})$ over $\Sigma$ , and $w=ba$ , the configuration $\delta_{A}(w)$ and the weight $f_{A}(w)$ are as follows.

[TABLE]

Fig. 1 illustrates the WFA $A=(Q_{A},\alpha_{A},\beta_{A},(A_{\sigma})_{\sigma\in\Sigma})$ , where the transitions with weight [math] are omitted.

inline,linecolor=orange,backgroundcolor=orange!25,bordercolor=orange]Masaki: If we need more space, we can move Definition 5 to the appendix. In that case, we should also rename the paragraph name.

Definition 5 (DFA).

A DFA is defined much like in Def. 1, except that

the entries of matrices are $\mathtt{tt}$ and $\mathtt{ff}$ ;
we replace the use of $+,\times$ with $\lor,\land$ , respectively; and
we impose determinacy, that exactly one entry is $\mathtt{tt}$ in each row of $A_{\sigma}$ , and that only one entry is $\mathtt{tt}$ in the initial vector $\alpha_{A}$ .

The definitions of $\delta_{A}$ and $f_{A}$ in Def. 2–3 adapt to DFAs. For $w\in\Sigma^{*}$ , the configuration vector $\delta_{A}(w)\in\{\mathtt{tt},\mathtt{ff}\}^{Q_{A}}$ has exactly one $\mathtt{tt}$ ; the state $q$ whose entry is $\mathtt{tt}$ is called the $w$ -successor of $A$ . $w$ is accepted by $A$ if $f_{A}(w)=\mathtt{tt}$ .

Recurrent Neural Networks

Our view of a recurrent neural network (RNN) is almost a black-box. We need only the following two operations: feeding an input word $w\in\Sigma^{*}$ and observing its output (a real number); and additionally, observing the internal state (a vector) after feeding a word. This allows us to model RNNs in the following abstract way.

Definition 6 (RNN).

Let $d\in\mathbb{N}$ be a natural number called a dimension. A (real-valued) RNN is a triple $R=(\alpha_{R},\beta_{R},g_{R})$ , where $\alpha_{R}\in\mathbb{R}^{d}$ is an initial state, $\beta_{R}\colon\mathbb{R}^{d}\to\mathbb{R}$ is an output function, and $g_{R}\colon\mathbb{R}^{d}\times\Sigma\to\mathbb{R}^{d}$ is called a transition function. The set $\mathbb{R}^{d}$ is called a state space.

Definition 7 (RNN configuration $\delta_{R}(w)$ , output $f_{R}(w)$ ).

Let $R$ be the RNN in Def. 6. The transition function $g_{R}$ naturally extends to words as follows: $g^{*}_{R}\colon\mathbb{R}^{d}\times\Sigma^{*}\to\mathbb{R}^{d}$ , defined inductively by $g^{*}_{R}(x,\varepsilon)=x$ and $g^{*}_{R}(x,w\sigma)=g_{R}\bigl{(}g^{*}_{R}(x,w),\sigma\bigr{)}$ , where $w\in\Sigma^{*}$ and $\sigma\in\Sigma$ .

The configuration $\delta_{R}(w)$ of the RNN $R$ at a word $w$ is defined by $\delta_{R}(w)=g^{*}_{R}(\alpha_{R},w)$ . The output $f_{R}(w)\in\mathbb{R}$ , of $R$ for the input $w$ , is defined by $f_{R}(w)=\beta_{R}\bigl{(}\delta_{R}(w)\bigr{)}$ .

2.1 Angluin’s $\mathrm{L}^{*}$ Algorithm

Angluin’s $\mathrm{L}^{*}$ -algorithm learns a given DFA $B$ by a series of membership and equivalence queries. We sketch the algorithm; see (?) for details. Its outline is in Fig. 2.

A membership query is a black-box observation of the DFA $B$ : it feeds $B$ with a word $w\in\Sigma^{*}$ ; and obtains $f_{B}(w)\in\{\mathtt{tt},\mathtt{ff}\}$ , i.e., whether $w$ is accepted by $B$ .

The core of the algorithm is to construct the observation table $T$ ; see Fig. 3(a). The table has words as the row and column labels; its entries are either $\mathtt{tt}$ or $\mathtt{ff}$ . The row labels are called access words; the column labels are test words. We let $\mathcal{A},\mathcal{T}$ stand for the sets of access and test words. The entry of $T$ at row $u\in\mathcal{A}$ and column $v\in\mathcal{T}$ is given by $f_{B}(uv)$ —a value that we can obtain from a suitable membership query.

Therefore we extend a table $T$ by a series of membership queries. We do so until $T$ becomes closed; this is the top loop in Fig. 2. A table $T$ is closed if, for any access word $u\in\mathcal{A}$ and $\sigma\in\Sigma$ , there is an access word $u^{\prime}\in\mathcal{A}$ such that

[TABLE]

The closedness condition essentially says that the role of the extended word $u\sigma$ is already covered by some existing word $u^{\prime}\in\mathcal{A}$ . The notion of “role” here, formalized in (1), is a restriction of the well-known Myhill–Nerode relation, from all words $v\in\Sigma^{*}$ to $v\in\mathcal{T}$ .

A closed table $T$ induces a DFA $A_{T}$ (Fig. 2), much like in the Myhill–Nerode theorem. We note that the resulting DFA $A_{T}$ is necessarily minimized. The DFA $A_{T}$ undergoes an equivalence query that asks if $A_{T}\cong B$ ; an equivalence query is answered with a counterexample—i.e., $w\in\Sigma^{*}$ such that $f_{A_{T}}(w)\neq f_{B}(w)$ —if $A_{T}\not\cong B$ .

The $\mathrm{L}^{*}$ algorithm is a deterministic learning algorithm (at least in its original form), unlike many recent learning algorithms that are statistical. The greatest challenge in practical use of the $\mathrm{L}^{*}$ algorithm is to answer equivalence queries. When $B$ is a finite automaton that generates a regular language, there is a complete algorithm for deciding the language equivalence $A_{T}\cong B$ . However, if we use a more expressive model in place of a DFA $B$ , checking $A_{T}\cong B$ becomes a nontrivial task.

2.2 $\mathrm{L}^{*}$ Algorithm for WFA Learning

The classic $\mathrm{L}^{*}$ algorithm for learning DFAs has seen a weighted extension (?): it learns a WFA $B$ , again via a series of membership and equivalence queries. The overall structure of the WFA learning algorithm stays the same as in Fig. 2; here we highlight major differences.

Firstly, the entries of an observation table $T$ are now real numbers, reflecting the fact that the value $f_{B}(uv)$ for a WFA $B$ is in $\mathbb{R}$ instead of in $\{\mathtt{tt},\mathtt{ff}\}$ (see Def. 3). An example of an observation table is given in Fig. 3(b).

Secondly, the notion of closedness is adapted to the weighted (i.e., linear-algebraic) setting, as follows. A table $T$ is closed if, for any access word $u\in\mathcal{A}$ and $\sigma\in\Sigma$ , the vector $\bigl{(}\,f_{B}(u\sigma\,v)\,\bigr{)}_{v\in\mathcal{T}}\in\mathbb{R}^{|T|}$ can be expressed as a linear combination of the vectors in $\bigl{\{}\,\bigl{(}\,f_{B}(u^{\prime}v)\,\bigr{)}_{v\in\mathcal{T}}\,\big{|}\,u^{\prime}\in\mathcal{A}\bigr{\}}$ . Note that the vector $\bigl{(}\,f_{B}(u^{\prime}v)\,\bigr{)}_{v\in\mathcal{T}}^{\top}$ in the latter set is precisely the row vector in $T$ at row $u^{\prime}$ .

For example, the table $T$ in Fig. 3(b) is obviously closed, since the three row vectors are linearly independent and thus span the whole $\mathbb{R}^{3}$ . The above definition of closedness comes natural in view of Def. 2. For a WFA, a configuration (during its execution) is not a single state, but a weighted superposition $x\in\mathbb{R}^{Q_{A}}$ of states. The closedness condition asserts that the role of $u\sigma$ is covered by a suitable superposition of words $u^{\prime}\in\mathcal{A}$ . The construction of the WFA $A_{T}$ from a closed table $T$ (see Fig. 2) reflects this intuition. See (?). We note that the resulting $A_{T}$ is minimal, much like in §2.1.

In the literature (?), an observation table $T$ is presented as a so-called Hankel matrix. This opens the way to further extensions of the method, such as an approximate learning algorithm via the singular-value decomposition (SVD).

3 WFA Extraction from an RNN

inline,linecolor=red,backgroundcolor=red!25,bordercolor=red]TODO: The column width exceeds the limit because of the procedure definition!!!

We present our main contribution, namely a procedure that extracts a WFA from a given RNN. After briefly describing its outline (that is the same as Fig. 2), we focus on the greatest challenge of answering equivalence queries.

3.1 Procedure Outline

Our procedure uses the weighted $\mathrm{L}^{*}$ algorithm sketched in §2.2. As we discussed in §2.1, the greatest challenge is how to answer equivalence queries; our novel approach is to use regression and synthesize what we call a configuration abstraction function $p\colon\mathbb{R}^{d}\to\mathbb{R}^{Q_{A}}$ . See Fig. 4.

The outline of our procedure thus stays the same as in Fig. 2, but we need to take care about noisy outputs from an RNN because they prevent the observation table from being precisely closed (in the sense of §2.2). To resolve this issue, we use a noise-tolerant algorithm (?) which approximately determines whether the observation table is closed. This approximate algorithm employs SVD and cuts off singular values that are smaller than a threshold called rank tolerance. In the choice of a rank tolerance, we face the trade off between accuracy and regularity. If the rank tolerance is large, the WFA learning algorithm tends to ignore the counterexample given by an equivalence query and results in producing an inaccurate WFA. We use a heuristic to decrease the rank tolerance when two or more equivalence queries return the same counterexample. See Appendix A.1 for details.

3.2 Equivalence Queries for WFAs and RNNs

Algorithm 1 shows our procedure to answer an equivalence query. The procedure Ans-EqQ is the main procedure, and it returns either $\mathtt{Equivalent}$ or a counterexample word (as in Fig. 2). It calls the auxiliary procedure Consistent?, which decides if we refine the current configuration abstraction function $p\colon\mathbb{R}^{d}\to\mathbb{R}^{Q_{A}}$ in Fig. 4 (Line 12).

Best-First Search for a Counterexample

The procedure Ans-EqQ is essentially a best-first search for a counterexample, that is, a word $h\in\Sigma^{*}$ such that the difference of the output values $f_{R}(h)$ from the RNN and $f_{A}(h)$ from the WFA is larger than error tolerance $e(>0)$ . We first outline Ans-EqQ and then go into the technical detail.

We manage counterexample candidates by the priority queue $\mathtt{queue}$ , which gives a higher priority to a candidate more likely to be a counterexample. The already investigated words are in the set $\mathtt{visited}$ . The queue $\mathtt{queue}$ initially contains only the empty word $\varepsilon$ (Line 5).

We search for a counterexample in the main loop starting from Line 6. Let $h$ be a word popped from $\mathtt{queue}$ , that is, the candidate most likely to be a proper counterexample among the words in the queue. If $h$ is a counterexample, Ans-EqQ returns it (Lines 8–9). Otherwise, after refining the configuration abstraction function $p$ (Lines 10–12), new candidates $h\sigma$ , the extension of $h$ with character $\sigma$ , are pushed to $\mathtt{queue}$ with their priorities only if the neighborhood of $h$ in the state space of the WFA $A$ does not contain sufficiently many already investigated words—i.e., only if it has not been investigated sufficiently (Lines 15–18). This is because, if the neighborhood of $h$ has been investigated sufficiently, we expect that the neighborhoods of the new candidates $h\sigma$ have also been investigated and, therefore, that the words $h\sigma$ do not have to be investigated furthermore. We use $p$ : 1) to decide if the neighborhood of $h$ has been investigated sufficiently (Line 15); and 2) to calculate the priorities of the new candidates (Line 17). Note that we add $h$ to $\mathtt{visited}$ in Line 13 since it has been investigated there. If all the candidates are not a counterexample, Ans-EqQ returns $\mathtt{Equivalent}$ (Line 19).

Configuration Abstraction Function $p$

To use $p$ for the above purposes, the property we expect from the configuration abstraction function $p\colon\mathbb{R}^{d}\to\mathbb{R}^{Q_{A}}$ is as follows:

[TABLE]

See Def. 2 and 7 for $\delta_{A}\colon\Sigma^{*}\to\mathbb{R}^{Q_{A}}$ and $\delta_{R}\colon\Sigma^{*}\to\mathbb{R}^{d}$ , respectively. To synthesize such a function $p$ , we employ regression using the data $\bigl{\{}\,\bigl{(}\delta_{R}(h^{\prime}),\delta_{A}(h^{\prime})\bigr{)}\in\mathbb{R}^{d}\times\mathbb{R}^{Q_{A}}\,\big{|}\,h\in\mathtt{visited}\,\bigr{\}}$ . See Line 12. Note that we can use any regression method to learn $p$ .

We refine $p$ during the best-first counterexample search. Specifically, in Line 10, we use the procedure Consistent? to check if the current $p$ —obtained by regression—is consistent with a counterexample candidate $h$ . The consistency means that $p(\delta_{R}(h))$ and $\delta_{A}(h)$ are close to each other, which is formalized by the relation $\simeq_{A}$ defined later. If the check fails (i.e., if Consistent? returns $\mathtt{NG}$ ), we refine $p$ by regression to make $p$ consistent with $h$ (and the already investigated words in $\mathtt{visited}$ ). See Line 12.

Consistency Checking by Consistent?

The procedure Consistent? in Line 10 is defined as follows: it returns $\mathtt{NG}$ if there exists $h^{\prime}\in\mathtt{visited}$ such that

[TABLE]

and returns $\mathtt{OK}$ otherwise. The basic idea of Consistent? is to return $\mathtt{NG}$ if $p(\delta_{R}(h))\not\simeq_{A}\delta_{A}(h)$ because it means the violation of the desired property (2). However, to reduce the run-time cost of refining $p$ and to prevent learning from outliers, we adopt the alternative approach presented above, which is taken from ? (?).

The existence of $h^{\prime}$ satisfying the condition (3) approximates the violation of the property (2) in the following sense. If there is a word $h^{\prime}$ satisfying the first part of the condition (3), $p$ has to be refined because we find the property (2) violated with $h^{\prime}$ . The second part of the condition (3) means that $h^{\prime}$ seems to behave similarly to $h$ according to the configuration abstraction function $p$ . We expect the second part to prevent $p$ from being refined with outliers because the neighborhoods of the words used for refining $p$ must have been investigated twice or more.

Equivalence Relation $\simeq_{A}$

For a given WFA $A$ , we define the relation $\simeq_{A}$ in the configuration space $\mathbb{R}^{Q_{A}}{}$ by

[TABLE]

where $\beta$ is the final vector of the WFA $A$ . It satisfies the following.

If $x\simeq_{A}y$ holds, the difference of the output values for the configurations $x$ and $y$ of the WFA $A$ is smaller than the error tolerance $e$ , that is, $\left|(x-y)\cdot\beta\right|<e$ . 2. 2.

If the $i$ -th element of the final vector $\beta$ becomes large, the neighborhood $\set{y\in\mathbb{R}^{Q_{A}}\mid x\simeq_{A}y}$ of $x$ shrinks in the direction of the $i$ -th axis. 3. 3.

The neighborhood defined by $\simeq_{A}$ is an ellipsoid—a reasonable variant of an open ball as a neighborhood.

A Heuristic for Equivalence Checking of a WFA and an RNN

Although the best-first search above works well, we introduce an additional heuristic to improve the run-time performance of our algorithm furthermore. The heuristic deems $R$ and $A$ to be equivalent if word $h$ previously popped from $\mathtt{queue}$ is so long that it is impossible to occur in the training of the RNN. This heuristic is based on the expectation that, when an impossible word is the most likely to be a counterexample, all possible words are unlikely to be a counterexample, and so $R$ and $A$ are considered to be equivalent. This heuristic is adopted immediately after popping $h$ (Line 7), as follows: we suppose that the maximum length $L$ of possible words is given; and, if the length of $h$ is larger than $L$ , Ans-EqQ returns $\mathtt{Equivalent}$ . We confirm the usefulness of this heuristic in §4 empirically.

Termination of the Procedure

Algorithm 1 does not always terminate in a finite amount of time. If the procedure does not find any counterexample at Line 9 and the points $p(\delta_{R}(\mathtt{visited}))$ are so scattered in the configuration space that the value $\#vn$ at Line 15 is always small, words are always pushed to $\mathtt{queue}$ at Line 18. In that case, the condition to exit the main loop at Line 6 is never satisfied.

3.3 Comparison with Weiss et al., 2018

Our WFA extraction method can be seen as a weighted extension of the $\mathrm{L}^{*}$ -based procedure (?) to extract a DFA from an RNN. Note that a WFA defines a function of type $\Sigma^{*}\to\mathbb{R}$ and a DFA defines a function of type $\Sigma^{*}\to\set{\mathtt{tt},\mathtt{ff}}$ .

The main technical novelty of the method in (?) is how to answer equivalence queries. It features the clustering of the state space $\mathbb{R}^{d}$ of an RNN into finitely many clusters, using support-vector machines (SVMs). Each cluster of $\mathbb{R}^{d}$ is associated with a state $q\in Q_{A}$ of the DFA.

Our theoretical observation is that such clustering amounts to giving a function $p\colon\mathbb{R}^{d}\to Q_{A}$ . Moreover, for a DFA, $Q_{A}$ is the configuration space (as well as the state space, see Def. 5). Therefore, our WFA extraction method can be seen as an extension of the DFA extraction procedure in (?).

4 Experiments

inline,linecolor=green,backgroundcolor=green!25,bordercolor=green]Taro: Algorithm or procedure? $\to$ procedure! (Sep 5 22:32)

We conducted experiments to evaluate the utility of our regression-based WFA extraction method. Specifically, we pose the following questions.

RQ1

Do the WFAs extracted by our algorithm approximate the original RNNs accurately?

RQ2

Does our algorithm work well with RNNs whose expressivity is beyond WFAs?

RQ3

Do the extracted WFAs run more efficiently, specifically in inference time, than given RNNs?

For RQ1, we compared the performance with a baseline algorithm (a straightforward adaptation of ? (?)’s $\mathrm{L}^{*}$ algorithm). Here we focused on “automata-like” RNNs, that is, those RNNs trained from an original WFA $A^{\bullet}$ . For RQ2, we used an RNN that exhibits “context-free” behaviors.

Experimental Setting We implemented our method in Python. We write $\mathbf{RGR}(n)$ for the algorithm where the concentration threshold $M$ in Ans-EqQ is set to $n$ ; other parameters are fixed as follows: error tolerance $e=0.05$ and heuristic parameter $L=20$ . We adopt the Gaussian process regression (GPR) provided by scikit-learn as a configuration abstraction function $p$ (we also tried the kernel ridge regression but GPR worked empirically better). Throughout the experiments, our RNNs are 2-layer LSTM networks with dimension size 50, implemented by TensorFlow. The experiments were on a g3s.xlarge instance on Amazon Web Service (May 2019), with a NVIDIA Tesla M60 GPU, 4 vCPUs of Xeon E5-2686 v4 (Broadwell), and 8GiB memory.

4.1 RQ1: Extraction from RNNs Modeling WFAs

This experiment examines how well our algorithm work for RNNs modeling WFAs. To do so, we first train RNNs using randomly generated WFAs; we call those WFAs the origins. Then, we evaluate our algorithm compared with a baseline from two points: accuracy of the extracted WFAs against the trained RNNs and running times of the algorithms. We report the results after presenting the details of the baseline, how to train RNNs, and how to evaluate the two algorithms.

The Baseline Algorithm: $\mathbf{BFS}(n)$ As our extraction algorithm, the baseline algorithm $\mathbf{BFS}(n)$ , which is parameterized over an integer $n$ , is a straightforward adaptation of ? (?)’s $\mathrm{L}^{*}$ algorithm. The difference is that equivalence queries in the baseline are implemented in breath-first search, as follows. Let $R$ be a given RNN and $A$ be a WFA being constructed. For each equivalence query, the baseline searches for a word $w$ such that $|f_{A}(w)-f_{R}(w)|>e$ (where $e=0.05$ ), in the breadth-first manner. If such a word $w$ is found it is returned as a counterexample. The search is restricted to the first $i+n$ words, where $i$ is the index of the counterexample word found in the previous equivalence query. If no counterexample $w$ is found within this search space, the baseline algorithm deems $A$ to be equivalent to $R$ . Obviously, if $n$ is larger, more counterexample candidates are investigated.

inline,linecolor=green,backgroundcolor=green!25,bordercolor=green]Taro: $\mathbf{BFS}(n)$ is different from the actual implementation.

Target RNNs $R$ trained from WFAs $A^{\bullet}$ Table 1 reports the accuracy of the extracted WFAs $A$ , where the target RNNs $R$ are obtained from original WFAs $A^{\bullet}$ in the following manner. Given an alphabet $\Sigma$ of a designated size (the leftmost column), we first generated a WFA $A^{\bullet}$ such that 1) the state-space size $|Q_{A^{\bullet}}|$ is as designated (the leftmost column), and 2) the initial vector $\alpha_{A}^{\bullet}$ , the final vector $\beta_{A}^{\bullet}$ , and the transition matrix $A_{\sigma}^{\bullet}$ are randomly chosen (with normalization so that its outputs are in $[0,1]$ ). Then we constructed a dataset $T$ by sampling 9000 words $w$ such that $w\in\Sigma^{*}$ and $|w|\leq 20$ ; this $T$ was used to train an RNN $R$ , on the set of input-output pairs of $w\in T$ and $f_{A^{\bullet}}(w)$ for 10 epochs.

A simple way to sample words $w\in T$ is by the uniform distribution. With $T$ covering the input space of the WFA $A^{\bullet}$ uniformly, we expect the resulting RNN $R$ to inherit properties of $A^{\bullet}$ well. The top table in Table 1 reports the results in this “more WFA-like” setting.

However, in many applications, the input domain of the data used for training RNNs are nonuniform, sometimes even excluding some specific patterns. To evaluate our method in such realistic settings, we conducted another set of experiments whose results are in the bottom of Table 1. Specifically, for training $R$ from $A^{\bullet}$ , we used a dataset $T$ that only contains those words $\sigma_{1}\sigma_{2}\dotsc\sigma_{n}$ which satisfy the following condition: if $\sigma_{i}=\sigma_{j}$ ( $i<j$ ), then $\sigma_{i}=\sigma_{i+1}=\dotsc=\sigma_{j-1}=\sigma_{j}$ . For example, for $\Sigma=\{a,b,c\}$ , $aabccc$ and $baaccc$ may be in $T$ , but $aaba$ may not.

Evaluation In order to evaluate accuracy, we calculated the mean square error (MSE) of the extracted WFA $A$ against the RNN $R$ , using a dataset $V$ of words sampled from an appropriate distribution, namely the one used in training the RNN $R$ from $A^{\bullet}$ . The dataset $V$ is sampled so that it does not overlap with the training dataset $T$ for $R$ .

Results and Discussions In the experiments in Table 1, we considered 8 configurations for generating the original WFA $A^{\bullet}$ (the leftmost column). The unit of MSEs are $-4$ —given also that the outputs of the original WFAs $A^{\bullet}$ are normalized to $[0,1]$ , we can say that the MSEs are small enough.

In the top table in Table 1 (the “more WFA-like” setting), $\mathbf{BFS}(5000)$ and $\mathbf{RGR}(5)$ achieved the first- and second-best performance in terms of accuracy, respectively (see the “Total” row). More generally, we can find the trend that, as an extraction runs longer, it performs better. We conjecture its reason as follows. Recall that all the RNNs are trained on words sampled from the uniform distribution. This means that all words would be somewhat informative to approximate the RNNs. As a result, the performance is more influenced by the amount of counterexamples—i.e., how long time extraction takes—than on their “qualities.”

The exception of this trend is $\mathbf{RGR}(2)$ , which took a longer time but performed worse than $\mathbf{BFS}(3000)$ , $\mathbf{BFS}(5000)$ , and $\mathbf{RGR}(5)$ . In particular, $\mathbf{RGR}(2)$ performed well for smaller alphabets ( $|\Sigma|\in\{4,6,10\}$ ) but not so when $|\Sigma|=15$ . The role of the parameter $M$ in $\mathbf{RGR}(M)$ (i.e., in Algorithm 1) is a threshold to control how many words configuration regions of a WFA are investigated with. Thus, we conjecture that the use of too small $M$ limits the input space to be investigated excessively, which is more critical as the input space is larger, eventually biasing the counterexamples $h$ (in Algorithm 1), though the RNNs are trained on the uniform distribution, and making refinement of WFAs less effective.

In the bottom table in Table 1 (the “realistic” setting), $\mathbf{RGR}(2)$ performs significantly better than the other (and the best among all the procedures) in terms of accuracy. This is the case even for a large alphabet ( $|\Sigma|=15$ ). This indicates that, in the cases that an RNN is trained with a nonuniform dataset, making the investigated input space larger by big $M$ could even degrade the accuracy performance. A possible reason for this degradation is as follows. Some words (such as $aba$ ) are prohibited in the sample set $T$ , and the behaviors of the RNN $R$ for those prohibited words are unexpected. Therefore, those prohibited words should not be useful to refine a WFA. The use of small $M$ could prevent such meaningless (or even harmful) counterexamples $h$ from being investigated. This discussion raises another question: how can we find an optimal $M$ ? We leave it as future work.

Let us briefly discuss the sizes of the extracted WFAs. The general trend is that the extracted WFAs $A$ have a few times greater number of states than the original WFAs $A^{\bullet}$ used in training $R$ . For example, in the setting of the top table in Table 1, for $|\Sigma|=15$ and $|Q_{A^{\bullet}}|=20$ , the average number of the states of the extracted $A$ was 38.2.

4.2 RQ2: Expressivity beyond WFAs

We conducted experiments to examine how well our method works for RNNs modeling languages that cannot be expressed by any WFA. Specifically, we used an RNN that models the following function $\mathtt{wparen}\colon\Sigma^{*}\to[0,1]$ : $\Sigma=\{(,),0,1,...,9\}$ , $\mathtt{wparen}(w)=1-(1/2)^{N}$ if all the parentheses in $w$ are balanced (here $N$ is the depth of the deepest balanced parentheses in $w$ ); and $\mathtt{wparen}(w)=0$ otherwise. This $\mathtt{wparen}$ is a weighted variant of a (non-regular) language of balanced parentheses. For instance, $\mathtt{wparen}(\text{``((3)(7))))''})=0$ , $\mathtt{wparen}(\text{``((3)(7))''})=1-(1/2)^{2}=3/4$ , and $\mathtt{wparen}(\text{``(a)(b)(c)''})=1/2$ .

We trained an RNN $R$ as follows. We generated datasets $T_{\mathrm{good}}$ and $T_{\mathrm{bad}}$ , and trained an RNN $R$ on the set of input-output pairs of $w\in T_{\mathrm{good}}\cup T_{\mathrm{bad}}$ and $\mathtt{wparen}(w)$ . The dataset $T_{\mathrm{good}}$ consists of randomly generated words where all the parentheses are balanced; $T_{\mathrm{bad}}$ is constructed similarly, except that we apply suitable mutation to each word, which most likely makes the parentheses unbalanced. See Appendix B.1 for details.

Fig. 5 shows the WFAs extracted from $R$ . Remarkable observations here are as follows.

•

The shapes of the WFAs—obtained by ignoring clearly negligible weights—give rise to NFAs that recognize balanced parentheses up-to a certain depth.

•

As the parameter $M$ in $\mathbf{RGR}(M)$ grows, the recognizable depth bound grows: depth one with $\mathbf{RGR}(5)$ ; and depth two with $\mathbf{RGR}(15)$ .

We believe these observations demonstrate important features, as well as limitations, of our method. Overall, the extracted WFAs expose interpretable structures hidden in an RNN: the NFA structures in Fig. 5 are human-interpretable (they are easily seen to encode bounded balancedness) and machine-processable (such as determinization and minimization). It is also suggested that the parameter $M$ gives us flexibility in the trade-off of extraction cost and accuracy. At the same time, we can only obtain a truncated and regularized version of the RNN structure—this is an inevitable limitation as long as we use the formalism of WFAs.

We also note that, in each of the two extracted WFAs, the transition matrices $A_{\sigma}$ are similar for all $\sigma\in\{0,1,\dotsc,9\}$ (the entries at the same position have the same order). This is as expected, too, since the function $\mathtt{wparen}$ does not distinguish the characters $0,1,\dotsc,9$ .

4.3 RQ3: Accelerating Inference Time

We conducted experiments about inference time, comparing the original RNNs $R$ and the WFAs $A$ that we extracted from $R$ . We used the same RNNs $R$ and WFAs $A$ as in §4.1, where the latter are extracted using $\mathbf{RGR}(2\text{--}5)$ and $\mathbf{BFS}(500\text{--}5000)$ . We note that the inference of RNNs utilizes GPUs while that of WFAs is solely done by CPUs.

We observed that the inference time of the extracted WFAs $A$ was about 1,300 times faster than the target RNNs $R$ , taking the average over different settings (Appendix B.3). This demonstrates the potential use of the extracted WFAs as a computationally cheaper surrogate for RNNs. We attribute the acceleration to the following: 1) WFAs use only linear computation while RNNs involve nonlinear ones; and 2) overall, extracted WFAs are smaller in size. Provided that the accuracy of extracted WFAs can be high (as we observed in §4.1), we believe the replacement of RNNs by WFAs is a viable option in some application scenarios.

5 Conclusions and Future Work

We proposed a method that extracts a WFA from an RNN, focusing on RNNs that take a word $w\in\Sigma^{*}$ and return a real value. We used regression to investigate and abstract the internal states of RNNs. We experimentally evaluated our method, comparing its performance with a baseline whose equivalence queries are based on simple breadth-first search.

One future work is a detailed comparison with other methods for model compression. Another future work is to use machine learning methods to find a counterexample in the equivalence query, such as reinforcement learning (?) adversarial attacks (?), and acquisition functions of GPR. Finally, we need a means to optimize parameter $M$ of our method for a specific problem. It may also be helpful to extend our method so that the investigated words can be restricted to a fixed language $L\subset\Sigma^{*}$ ; If $L$ identifies the input space of the training dataset for RNNs, we could avoid investigating the input space on which the RNNs are not trained, and therefore we could seek only “meaningful” counterexamples even in using large $M$ .

6 Acknowledgments

Thanks are due to Mahito Sugiyama and the anonymous reviewers of AAAI for a lot of useful comments. This work is partially supported by JST ERATO HASUO Metamathematics for Systems Design Project (No. JPMJER1603), JSPS KAKENHI Grant Numbers JP15KT0012, JP18J22498, JP19K20247, JP19K22842, and JST-Mirai Program Grant Number JPMJMI18BA, Japan.

Appendix A Detail of Our WFA Extraction

A.1 On Rank Tolerance

Construction step calculates a minimal WFA that is compatible with observed data $(w_{1},f_{R}(w_{1})),\dots,(w_{n},f_{R}(w_{n}))\in\Sigma^{*}\times\mathbb{R}$ . It relies on rank calculation, and the calculation is done by computing the SVD of the matrix and counting the number of non-zero singular values. The threshold to check whether the singular value is zero or not is called rank tolerance. A small rank tolerance results in accurate learning basically but can cause overfitting for short words and huge error for long words. A large rank tolerance results in rough learning but prevents such overfitting. To balance the rank tolerance, we start from a big initial rank tolerance $\tau$ , and if it is too big then we decay it by multiplying $r(0<r<1)$ . We know that the rank tolerance is too big if the equivalence query returns the same counterexample twice because it means the counterexample was ignored. Overall, we obtain the WFA Extraction procedure (Algorithm 2).

Appendix B Detail of the Experiments

B.1 On Training Data Generation for $\mathtt{wparen}$

We made the training data for $\mathtt{wparen}$ in this manner.

We make 5000 words of random balanced parentheses made only of $\set{(,)}$ . There is a one-to-one correspondence between words of balanced parentheses of length $2n$ and paths from the bottom-left to the top-right in the grid of size $n\times n$ whose bottom-right half is removed, so we can obtain such random words by generating the paths randomly and converting them into the words. For example, “(())” or “(()())” can be made. 2. 2.

We insert random characters in $\set{0,1,\dots,9}$ into the words generated in Step 1. This generates 5000 words of random balanced words made of $\set{(,),0,1,\dots,9}$ . For example, “(0(1))” or “((12340)())” can be made. 3. 3.

We run the same procedure as Step 1 and obtain 5000 words of random balanced parentheses. 4. 4.

We mutate the words in Step 3 and make them into 5000 random unbalanced parentheses made only of $\set{(,)}$ . The mutation rules are as follows: 1) duplicate a random character; 2) delete a random character; and 3) exchange a random pair of adjacent characters. These rules are repeatedly applied—each time throwing a fair coin—until we get the head of the coin. Note that the mutation can make a balanced word into another balanced word. For example, “(()”, “((((”, or “()” can be made (only the last one is balanced). 5. 5.

We insert random characters in $\set{0,1,\dots,9}$ into the words generated in Step 4. This generates 5000 words of random unbalanced words made of $\set{(,),0,1,\dots,9}$ . 6. 6.

We combine the result of Step 2 and 5 and get 10000 words. Almost the half of the words are balanced and the other half are unbalanced. We pick 9000 random words from the words and use them as the training data; the remaining 1000 are used as the test data.

B.2 Detailed WFAs Extracted from $\mathtt{wparen}$

The WFA Extracted by $\mathbf{RGR}(5)$

Fig. 6 illustrates the WFA extracted from the RNN trained by $\mathtt{wparen}$ by $\mathbf{RGR}(5)$ . The initial and final vectors, and the transition matrices are in Fig. 7 and 8.

The WFA Extracted by $\mathbf{RGR}(15)$

Fig. 9 illustrates the WFA extracted from the RNN trained by $\mathtt{wparen}$ by $\mathbf{RGR}(15)$ . The initial and final vectors, and the transition matrices are in Fig. 10, 11, and 12.

B.3 Inference Time of the Target RNNs and the Extracted WFAs

On average, the inference time of the target RNNs was 29.97519233 milliseconds, while that of the extracted WFAs was 0.023052549 milliseconds. Therefore, on average, the inference of the extracted WFAs was about 1300.298397 times faster than that of the target RNNs.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Angluin 1987] Angluin, D. 1987. Learning regular sets from queries and counterexamples. Inf. Comput. 75(2):87–106.
2[Ayache, Eyraud, and Goudian 2018] Ayache, S.; Eyraud, R.; and Goudian, N. 2018. Explaining black boxes on sequential data using weighted automata. In Unold, O.; Dyrka, W.; and Wieczorek, W., eds., Proc. ICGI 2018 , volume 93 of Proceedings of Machine Learning Research , 81–103. PMLR.
3[Baier and Katoen 2008] Baier, C., and Katoen, J.-P. 2008. Principles of Model Checking . The MIT Press.
4[Balle and Mohri 2015] Balle, B., and Mohri, M. 2015. Learning weighted automata. In Maletti, A., ed., Proc. CAI 2015 , volume 9270 of Lecture Notes in Computer Science , 1–21. Springer.
5[Balle, Gourdeau, and Panangaden 2017] Balle, B.; Gourdeau, P.; and Panangaden, P. 2017. Bisimulation metrics for weighted automata. In Chatzigiannakis, I.; Indyk, P.; Kuhn, F.; and Muscholl, A., eds., Proc. ICALP 2017 , volume 80 of LIP Ics , 103:1–103:14. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik.
6[Bramer 2013] Bramer, M. 2013. Ensemble Classification . London: Springer London. 209–220.
7[Bucila, Caruana, and Niculescu-Mizil 2006] Bucila, C.; Caruana, R.; and Niculescu-Mizil, A. 2006. Model compression. In Proc. KDD 2006 , 535–541.
8[Chaudhuri 2019] Chaudhuri, A. 2019. Visual and Text Sentiment Analysis through Hierarchical Deep Learning Networks . Springer Briefs in Computer Science. Springer.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Weighted Automata Extraction from Recurrent Neural Networks

Abstract

1 Introduction

Background

Extracting WFAs from RNNs

Contribution: Regression-Based WFA Extraction from RNNs

Potential Applications

Related Work

2 Preliminaries

Definition 1** (WFA).**

Definition 2** (configuration of a WFA).**

Definition 3** (weight fA(w)f_{A}(w)fA​(w) of a word in a WFA).**

Example 4**.**

Definition 5** (DFA).**

Recurrent Neural Networks

Definition 6** (RNN).**

Definition 7** (RNN configuration δR(w)\delta_{R}(w)δR​(w), output fR(w)f_{R}(w)fR​(w)).**

2.1 Angluin’s L∗\mathrm{L}^{*}L∗ Algorithm

2.2 L∗\mathrm{L}^{*}L∗ Algorithm for WFA Learning

3 WFA Extraction from an RNN

3.1 Procedure Outline

3.2 Equivalence Queries for WFAs and RNNs

Best-First Search for a Counterexample

Configuration Abstraction Function ppp

Consistency Checking by Consistent?

Equivalence Relation ≃A\simeq_{A}≃A​

A Heuristic for Equivalence Checking of a WFA and an RNN

Termination of the Procedure

3.3 Comparison with Weiss et al., 2018

4 Experiments

4.1 RQ1: Extraction from RNNs Modeling WFAs

4.2 RQ2: Expressivity beyond WFAs

4.3 RQ3: Accelerating Inference Time

5 Conclusions and Future Work

6 Acknowledgments

Appendix A Detail of Our WFA Extraction

A.1 On Rank Tolerance

Appendix B Detail of the Experiments

B.1 On Training Data Generation for wparen\mathtt{wparen}wparen

B.2 Detailed WFAs Extracted from wparen\mathtt{wparen}wparen

The WFA Extracted by RGR(5)\mathbf{RGR}(5)RGR(5)

The WFA Extracted by RGR(15)\mathbf{RGR}(15)RGR(15)

B.3 Inference Time of the Target RNNs and the Extracted WFAs

Definition 1 (WFA).

Definition 2 (configuration of a WFA).

Definition 3 (weight $f_{A}(w)$ of a word in a WFA).

Example 4.

Definition 5 (DFA).

Definition 6 (RNN).

Definition 7 (RNN configuration $\delta_{R}(w)$ , output $f_{R}(w)$ ).

2.1 Angluin’s $\mathrm{L}^{*}$ Algorithm

2.2 $\mathrm{L}^{*}$ Algorithm for WFA Learning

Configuration Abstraction Function $p$

Equivalence Relation $\simeq_{A}$

B.1 On Training Data Generation for $\mathtt{wparen}$

B.2 Detailed WFAs Extracted from $\mathtt{wparen}$

The WFA Extracted by $\mathbf{RGR}(5)$

The WFA Extracted by $\mathbf{RGR}(15)$