Computations in Stochastic Acceptors

Karl-Heinz Zimmermann

arXiv:1812.09687·cs.LG·December 27, 2018

Computations in Stochastic Acceptors

Karl-Heinz Zimmermann

PDF

Open Access

TL;DR

This paper introduces dynamic programming algorithms for stochastic acceptors, enabling computation of input marginals, acceptance probabilities, and parameter estimation using EM and Baum-Welch algorithms.

Contribution

It provides novel algorithms for probabilistic automata, including efficient parameter estimation methods, advancing their application in machine learning contexts.

Findings

01

Algorithms for input marginal computation

02

Acceptance probability calculation methods

03

Efficient EM-based parameter estimation

Abstract

Machine learning provides algorithms that can learn from data and make inferences or predictions on data. Stochastic acceptors or probabilistic automata are stochastic automata without output that can model components in machine learning scenarios. In this paper, we provide dynamic programming algorithms for the computation of input marginals and the acceptance probabilities in stochastic acceptors. Furthermore, we specify an algorithm for the parameter estimation of the conditional probabilities using the expectation-maximization technique and a more efficient implementation related to the Baum-Welch algorithm.

Equations106

s^{'} \in S \sum p (s^{'} ∣ a, s) = 1.

s^{'} \in S \sum p (s^{'} ∣ a, s) = 1.

\displaystyle\hat{p}(s^{\prime}\mid\epsilon,s)=\left\{\begin{array}[]{ll}1&\mbox{if }s=s^{\prime},\\ 0&\mbox{if }s\neq s^{\prime},\end{array}\right.

\displaystyle\hat{p}(s^{\prime}\mid\epsilon,s)=\left\{\begin{array}[]{ll}1&\mbox{if }s=s^{\prime},\\ 0&\mbox{if }s\neq s^{\prime},\end{array}\right.

\overset{p}{^} (s^{'} ∣ x a, s) = t \in S \sum \overset{p}{^} (t ∣ x, s) \cdot p (s^{'} ∣ a, t) .

\overset{p}{^} (s^{'} ∣ x a, s) = t \in S \sum \overset{p}{^} (t ∣ x, s) \cdot p (s^{'} ∣ a, t) .

s^{'} \in S \sum p (s^{'} ∣ x, s) = 1, x \in Σ^{*}, s \in S .

s^{'} \in S \sum p (s^{'} ∣ x, s) = 1, x \in Σ^{*}, s \in S .

p (s^{'} ∣ x x^{'}, s) = t \in S \sum p (t ∣ x, s) \cdot p (s^{'} ∣ x^{'}, t) .

p (s^{'} ∣ x x^{'}, s) = t \in S \sum p (t ∣ x, s) \cdot p (s^{'} ∣ x^{'}, t) .

P (ϵ) = I_{n},

P (ϵ) = I_{n},

P (x a) = P (x) \cdot P (a) .

P (x a) = P (x) \cdot P (a) .

P (x x^{'}) = P (x) \cdot P (x^{'}) .

P (x x^{'}) = P (x) \cdot P (x^{'}) .

P (x) = P (x_{1}) \dots P (x_{k}) .

P (x) = P (x_{1}) \dots P (x_{k}) .

L_{A, λ} = {x \in Σ ∣ π P (x) f > λ}

L_{A, λ} = {x \in Σ ∣ π P (x) f > λ}

P(a)=\left(\begin{array}[]{cc}1-\frac{a}{p}&\frac{a}{p}\\ 1-\frac{a+1}{p}&\frac{a+1}{p}\end{array}\right),\;0\leq a\leq p-1,\;\pi=(1,0),\;\mbox{and}\;f=\left(\begin{array}[]{c}0\\ 1\end{array}\right).

P(a)=\left(\begin{array}[]{cc}1-\frac{a}{p}&\frac{a}{p}\\ 1-\frac{a+1}{p}&\frac{a+1}{p}\end{array}\right),\;0\leq a\leq p-1,\;\pi=(1,0),\;\mbox{and}\;f=\left(\begin{array}[]{c}0\\ 1\end{array}\right).

L_{A, λ} = {x_{1} \dots x_{k} \in {0, \dots, p - 1}^{*} ∣ 0. x_{k} \dots x_{1} > λ} .

L_{A, λ} = {x_{1} \dots x_{k} \in {0, \dots, p - 1}^{*} ∣ 0. x_{k} \dots x_{1} > λ} .

p_{X, S} = p_{S_{1}} p_{X_{1}} p_{S_{2} ∣ X_{1}, S_{1}} p_{X_{2}} p_{S_{3} ∣ X_{2}, S_{2}} \dots p_{X_{n}} p_{S_{n + 1} ∣ X_{n}, S_{n}} .

p_{X, S} = p_{S_{1}} p_{X_{1}} p_{S_{2} ∣ X_{1}, S_{1}} p_{X_{2}} p_{S_{3} ∣ X_{2}, S_{2}} \dots p_{X_{n}} p_{S_{n + 1} ∣ X_{n}, S_{n}} .

θ_{s^{'}; a, s} = p_{S_{i + 1} ∣ X_{i}, S_{i}} (s^{'} ∣ a, s), s, s^{'} \in S, a \in Σ, 1 \leq i \leq n .

θ_{s^{'}; a, s} = p_{S_{i + 1} ∣ X_{i}, S_{i}} (s^{'} ∣ a, s), s, s^{'} \in S, a \in Σ, 1 \leq i \leq n .

p_{X, S} (x_{1}, \dots, x_{n}, s_{1}, \dots, s_{n + 1}) = \frac{1}{l ^{' n}} π_{s_{1}} θ_{s_{2}; x_{1}, s_{1}} \dots θ_{s_{n + 1}; x_{n}, s_{n}} .

p_{X, S} (x_{1}, \dots, x_{n}, s_{1}, \dots, s_{n + 1}) = \frac{1}{l ^{' n}} π_{s_{1}} θ_{s_{2}; x_{1}, s_{1}} \dots θ_{s_{n + 1}; x_{n}, s_{n}} .

p_{X} (x) = s \in S^{n + 1} \sum p_{X, S} (x, s) .

p_{X} (x) = s \in S^{n + 1} \sum p_{X, S} (x, s) .

p_{X} (x) =

p_{X} (x) =

M [0, s]

M [0, s]

M [k, s]

p_{X} (x)

π P (x) f =

π P (x) f =

M [0, s]

M [0, s]

M [k, s]

π P (x) f

Θ = {θ = (θ_{s^{'}; a, s}) ∣ θ_{s^{'}; a, s} \geq 0, s^{'} \sum θ_{s^{'}; a, s} = 1} .

Θ = {θ = (θ_{s^{'}; a, s}) ∣ θ_{s^{'}; a, s} \geq 0, s^{'} \sum θ_{s^{'}; a, s} = 1} .

θ_{s^{'}; a, s} = p (s^{'} ∣ a, s), a \in Σ, s, s^{'} \in S .

θ_{s^{'}; a, s} = p (s^{'} ∣ a, s), a \in Σ, s, s^{'} \in S .

p_{X, S ∣Θ} (d_{r} ∣ θ) = \frac{1}{l ^{' n}} π_{s_{1}} i = 1 \prod n θ_{s_{r, i + 1}; x_{r, i}, s_{r, i}} .

p_{X, S ∣Θ} (d_{r} ∣ θ) = \frac{1}{l ^{' n}} π_{s_{1}} i = 1 \prod n θ_{s_{r, i + 1}; x_{r, i}, s_{r, i}} .

L (θ)

L (θ)

(x, s) \sum u_{x, s} = N .

(x, s) \sum u_{x, s} = N .

L (θ) = a \in Σ \prod s, s^{'} \in S \prod θ_{s^{'}; a, s}^{v_{s^{'}; a, s}} .

L (θ) = a \in Σ \prod s, s^{'} \in S \prod θ_{s^{'}; a, s}^{v_{s^{'}; a, s}} .

ℓ (θ) = lo g L (θ) = a \in Σ \sum s, s^{'} \in S \sum v_{s^{'}; a, s} θ_{s^{'}; a, s} .

ℓ (θ) = lo g L (θ) = a \in Σ \sum s, s^{'} \in S \sum v_{s^{'}; a, s} θ_{s^{'}; a, s} .

v = B_{l, l^{'}} \cdot u,

v = B_{l, l^{'}} \cdot u,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · semigroups and automata theory · Algorithms and Data Compression

Full text

Computations in Stochastic Acceptors

Karl-Heinz Zimmermann111Email: [email protected]

Department of Electrical Engineering, Computer Science, Mathematics

Hamburg University of Technology

21071 Hamburg, Germany

Abstract

Machine learning provides algorithms that can learn from data and make inferences or predictions on data. Stochastic acceptors or probabilistic automata are stochastic automata without output that can model components in machine learning scenarios. In this paper, we provide dynamic programming algorithms for the computation of input marginals and the acceptance probabilities in stochastic acceptors. Furthermore, we specify an algorithm for the parameter estimation of the conditional probabilities using the expectation-maximization technique and a more efficient implementation related to the Baum-Welch algorithm.

AMS Subject Classification: 68Q70, 68T05

Keywords: Probabilistic automaton, dynamic programming, parameter estimation, EM algorithm, Baum-Welch algorithm

1 Introduction

The theory of discrete stochastic systems has been first studied by Shannon [14] and von Neumann [5]. Shannon has considered memory-less communication channels and their generalization by introducing states, while von Neumann has investigated the synthesis of reliable systems from unreliable components. The seminal research work of Rabin and Scott [9] about deterministic finite-state automata has led to two generalizations. First, the generalization of transition functions to conditional distributions studied by Carlyle [6] and Starke [15]. Second, the generalization of regular sets by introducing stochastic acceptors as described by Rabin [8].

A stochastic acceptor or probabilistic automaton is a stochastic automaton without output [3, 13, 18]. It generalizes the nondeterministic finite automaton by involving the probability of transition from one state to another and in this way generalizes the concept of Markov chain. The languages accepted by stochastic acceptors are called stochastic languages. The class of stochastic languages is uncountable and includes the regular languages as a proper subclass.

Stochastic automata have widespread use in the modeling of stochastic systems such as in traffic theory and in spoken language understanding for the recognition and interpretation of speech signals [3, 12, 10]. They can be used as building blocks in situations of machine learning where detailed mathematical description is missing and feature management is noisy. The arrangement of stochastic automata in the form of teams or hierarchies could lead to solutions of complex inference problems [16].

Stochastic acceptors have been generalized to a quantum analog, the quantum finite automaton [4]. The latter are linked to quantum computers as stochastic acceptors are connected to conventional computers.

In this paper, we provide dynamic programming algorithms for the computation of input marginals and the acceptance probabilities in a stochastic acceptor. Moreover, we specify an algorithm for the parameter estimation of the conditional probabilities using the expectation-maximization technique and a variant of the Baum-Welch algorithm. The text is to a large extent self-contained and also suitable to non-experts in this field.

2 Mathematical Preliminaries

A stochastic acceptor (SA) [3, 8, 13] is a quintuple $A=(S,\Sigma,P,\pi,f)$ , where $S$ is a nonempty finite set of states, $\Sigma$ is an alphabet of input symbols, $P$ is a collection $\{P(a)\mid a\in\Sigma\}$ of stochastic $n\times n$ matrices, where $n$ is the number of states, $\pi$ is the initial distribution of the states written as row vector, and $f$ is a binary column vector of length $n$ called final state vector.

Let $S=\{s_{1},\ldots,s_{n}\}$ be the state set. Then the final state vector is $f=(f_{1},\ldots,f_{n})^{t}$ and $F=\{s_{i}\mid f_{i}=1\}$ is the final state set. Moreover, the matrices $P(a)=(p_{ij}(a))$ with $a\in\Sigma$ are transition probability matrices, where the $(i,j)$ th entry $p_{ij}(a)=p(s_{j}\mid a,s_{i})$ is the conditional probability of transition from state $s_{i}$ to state $s_{j}$ when the symbol $a$ is read, $1\leq i,j\leq n$ . Thus for each symbol $a\in\Sigma$ and each state $s\in S$ ,

[TABLE]

Given a conditional probability distribution $p(\cdot\mid a,s)$ on $\Sigma\times S$ , a probability distribution $\hat{p}$ on $\Sigma^{*}\times S$ can be defined recursively as follows.

•

For each $s,s^{\prime}\in S$ ,

[TABLE]

where $\epsilon$ denotes the empty word in $\Sigma^{*}$ .

•

For all $s,s^{\prime}\in S$ , $a\in\Sigma$ , and $x\in\Sigma^{*}$ ,

[TABLE]

Then $\hat{p}(\cdot\mid x,s)$ is a conditional probability distribution on $\Sigma^{*}\times S$ and so we have

[TABLE]

Note that the measures $p$ and $\hat{p}$ coincide on the set $S\times\Sigma\times S$ if we put $x=\epsilon$ in (5). Therefore, we write $p$ instead of $\hat{p}$ .

A stochastic acceptor works serially and synchronously. It reads an input word symbol by symbol and after reading an input symbol it transits into another state. In particular, if the automaton starts in state $s$ and reads the word $x$ , then with probability $p(s^{\prime}\mid x,s)$ it will end in state $s^{\prime}$ taking all intermediate states into account.

Proposition 2.1.

For all $x,x^{\prime}\in\Sigma^{*}$ and $s,s^{\prime}\in S$ ,

[TABLE]

This result can be described by probability matrices. To this end, for the empty word $\epsilon\in\Sigma^{*}$ put

[TABLE]

where $I_{n}$ is the $n\times n$ unit matrix. Furthermore, if $a\in\Sigma$ and $x\in\Sigma^{*}$ , then by (5)

[TABLE]

By Prop. 2.1 and the associativity of matrix multiplication, we obtain the following

Proposition 2.2.

For all $x,x^{\prime}\in\Sigma^{*}$ ,

[TABLE]

It follows by induction that if $x=x_{1}\ldots x_{k}\in\Sigma^{*}$ , then

[TABLE]

Let $A=(S,\Sigma,P,\pi,f)$ be a stochastic acceptor and let $\lambda$ be a real number with $0\leq\lambda\leq 1$ . The set

[TABLE]

is the language of $A$ w.r.t. $\lambda$ , and $\lambda$ is the cut point of $L_{A,\lambda}$ .

Example 1.

Let $p\geq 2$ be an integer. Consider the $p$ -adic stochastic acceptor $A=(\{s_{1},s_{2}\},\{0,\ldots,p-1\},P,\pi,f)$ with

[TABLE]

See Fig. 1. Each word $x=x_{1}\ldots x_{k}\in\{0,\ldots,p-1\}^{*}$ can be assigned the real number whose $p$ -adic representation is $0.x_{k}\ldots x_{1}$ . For each cut point $\lambda$ , the accepted language is

[TABLE]

Note that the language $L_{A,\lambda}$ is regular if and only if the cut point $\lambda$ is rational [3, 8, 9]. $\diamondsuit$

For each input word $x\in\Sigma^{*}$ , the stochastic matrix $P(x)$ can be viewed as generating a discrete-time Markov chain. Thus the behavior of a stochastic automaton is an interleaving of Markov chains each of which corresponding to a single input symbol.

3 Input Marginals and Acceptance Probabilities

The input marginals and the acceptance probabilities can be computed by the technique of dynamic programming [2] using sum-product decomposition.

To see this, let $A=(S,\Sigma,P,\pi,f)$ be a stochastic acceptor with $l$ -element state set $S$ and $l^{\prime}$ -element input set $\Sigma$ . A stochastic acceptor can be viewed as a belief network. To this end, let $n\geq 1$ be an integer. Let $X_{1},\ldots,X_{n}$ be random variables with common state set $\Sigma$ and let $S_{1},\ldots,S_{n+1}$ be random variables with common state set $S$ . The stochastic acceptor can be described for inputs of length $n$ by the belief network [1, 11, 18] as shown in Fig. 2. Then the corresponding joint probability distribution factoring according to the network is given by

[TABLE]

We assume for simplicity that the initial distributions $p_{X_{i}}$ are uniform; i.e., $p_{X_{i}}(x)=\frac{1}{l^{\prime}}$ for all $x\in\Sigma$ and $1\leq i\leq n$ . Moreover, the network is assumed to be homogeneous in the sense that the conditional distributions $p_{S_{i+1}|X_{i},S_{i}}$ are independent of the index $i$ , $1\leq i\leq n$ . Therefore, we put

[TABLE]

It follows that the joint probability distribution has the form

[TABLE]

The probability of an input sequence $x=x_{1}\ldots x_{n}\in\Sigma^{n}$ is given by the marginal distribution

[TABLE]

The corresponding sum-product decomposition yields

[TABLE]

According to this decomposition, the marginal probability $p_{X}(x)$ can be calculated by using an $n\times l$ table $M$ :

[TABLE]

The time complexity of this algorithm is $O(l^{2}n)$ , since the table $M$ has size $O(ln)$ and each table entry is computed in $O(l)$ steps. The marginal probabilities $p_{X}(x)$ will be used in the EM and BM algorithms later on.

On the other hand, the acceptance probability of an input sequence $x=x_{1}\ldots x_{n}\in\Sigma^{n}$ is given by the sum-product decomposition

[TABLE]

This decomposition can be used to compute the acceptance probability by using an $n\times l$ table $M$ :

[TABLE]

Similarly, the time complexity of this algorithm is $O(l^{2}n)$ , since the table $M$ has size $O(ln)$ and each table entry is computed in $O(l)$ steps.

4 Parameter Estimation

The objective is to estimate the conditional probabilities of a stochastic acceptor by using sample data. For this, the stochastic acceptor is viewed as a belief network as described in the previous section. For this, let $A=(S,\Sigma,P,\pi,f)$ be a stochastic acceptor with $l=|S|$ and $l^{\prime}=|\Sigma|$ , and let $n\geq 1$ . Take the parameter set

[TABLE]

where

[TABLE]

The aim is to estimate these probabilities by making use of a sample set. For this, assume that there is a collection $D=(d_{1},\ldots,d_{N})$ of $N$ independent samples called database, where $d_{r}=(x_{r},s_{r})\in\Sigma^{n}\times S^{n+1}$ denotes the $r$ -th sample, $1\leq r\leq N$ . For simplicity, suppose the initial distributions $p_{X_{i}}$ are uniform as before, $1\leq i\leq n$ . Then the joint probability of the sample $d_{r}=(x_{r},s_{r})$ depending on the parameters is given by

[TABLE]

Thus the likelihood function $L=L_{X,S}$ is given by

[TABLE]

where $u_{x,s}$ is the number of times the input-state pair $(x,s)$ is observed in the sample set. Therefore, we have

[TABLE]

Let $v_{s^{\prime};a,s}$ be the number of times the parameter $\theta_{s^{\prime};a,s}$ occurs in the likelihood function $L(\theta)$ . Then the likelihood function can be written (up to a constant) as

[TABLE]

The corresponding log-likelihood function $\ell=\ell_{X,S}$ is

[TABLE]

The data $v=(v_{s^{\prime};a,s})$ form the sufficient statistic of the model. These data can be obtained from the given data $u=(u_{x,s})$ by the linear transformation

[TABLE]

where $B=B_{l,l^{\prime}}$ is an integral matrix with $d=l^{2}l^{\prime}$ rows labeled by the triples $(s^{\prime};a,s)$ with $a\in\Sigma$ and $s,s^{\prime}\in S$ . Moreover, the matrix has $m=l^{\prime n}l^{n+1}$ columns labeled by the pairs $(x,s)\in\Sigma^{n}\times S^{n+1}$ . The matrix has entry $k$ in row $(s^{\prime};a,s)$ and column $(x,s)$ if the parameter $\theta_{s^{\prime};a,s}$ occurs $k$ times in $p_{X,S|\Theta}(x,s)$ . Note that the matrix has column sum $n$ , since the quantity $p_{X,S|\Theta}(x,s)$ has $n$ factors.

Example 2.

Consider the 2-adic stochastic acceptor $A$ with state set $S=\{a,b\}$ and input set $\Sigma=\{0,1\}$ , and let $n=2$ . The associated $8\times 32$ matrix $B=B_{2,2}$ is as follows,

[TABLE]

$\diamondsuit$

Proposition 4.1.

The maximum likelihood estimate of the likelihood function $L(\theta)$ is given by

[TABLE]

Proof.

Let $S=\{s_{1},\ldots,s_{l}\}$ and $\Sigma=\{a_{1},\ldots,a_{l^{\prime}}\}$ . For each input-state pair $(a_{i},s_{j})$ , $1\leq i\leq l^{\prime}$ , $1\leq j\leq l$ , we have

[TABLE]

The parameters $\theta_{s_{m};a_{i}s_{j}}$ with $1\leq m\leq l$ appear in the log-likelihood function $\ell(\theta)$ as the partial sum

[TABLE]

Using $\theta_{s_{l};a_{i},s_{j}}=1-\sum_{s_{m}\neq s_{l}}\theta_{s_{m};a_{i},s_{j}}$ , the partial derivative of $\ell_{i,j}$ with respect to $\theta_{s_{m};a_{i},s_{j}}$ becomes

[TABLE]

Equating this expression to 0 gives $\hat{\theta}_{s_{m};a_{i},s_{j}}$ as claimed. Thus the vector $\hat{\theta}=(\hat{\theta}_{s_{m};a_{i},s_{j}})$ is a critial point of the likelihood function.

Claim that this point maximizes the likelihood function; the proof idea goes back to Koski et al. [11]. Indeed, let $H(\theta)=-\sum_{i=1}^{n}\log\theta_{i}$ denote the entropy of a probability distribution $\theta=(\theta_{1},\ldots,\theta_{n})$ and let $D(\theta\|\theta^{\prime})=\sum_{i=1}^{n}\theta_{i}\log\left(\frac{\theta_{i}}{\theta^{\prime}_{i}}\right)$ denote the Kullback-Leibler measure between two probability distributions $\theta=(\theta_{1},\ldots,\theta_{n})$ and $\theta^{\prime}=(\theta^{\prime}_{1},\ldots,\theta^{\prime}_{n})$ . Then we have

[TABLE]

where $v_{a,s}=\sum_{s^{\prime\prime}\in S}v_{s^{\prime\prime};a,s}$ , $\theta_{a,s}=(\theta_{s^{\prime};a,s})$ and $\hat{\theta}_{a,s}=(\hat{\theta}_{s^{\prime};a,s})$ for each input-state pair $(a,s)$ . Since the Kullback-Leibler measure is always non-negative [11], we obtain

[TABLE]

This proves the claim and the result follows. ∎

A stochastic acceptor is an abstract machine with an input interface. Therefore, suppose the sample data consist only of the input sequences, while the observer has no access to the state sequences. This problem can be tackled by the expectation-maximization (EM) algorithm. This is an iterative method to find the maximum posterior estimates of parameters in a statistical model with unobserved latent variables.

The aim is to estimate these probabilities by making use of a sample set. For this, let $A=(S,\Sigma,P,\pi,f)$ be a stochastic acceptor in the above setting and let $n\geq 1$ . We assume that there is a collection $D=(d_{1},\ldots,d_{N})$ of $N$ independent samples called database, where $d_{r}=x_{r}\in\Sigma^{n}$ denotes the $r$ -th input sample, $1\leq r\leq N$ . Then the probability of the sample $d_{r}$ depending on the parameters is given by the marginal distribution

[TABLE]

The likelihood function $L=L_{X}$ is given by

[TABLE]

and the log-likelihood function $\ell=\ell_{X}$ is

[TABLE]

where $u_{x}$ is the number of times the input sequence $x$ is observed in the sample set. Therefore, we have

[TABLE]

A version of the EM algorithm for stochastic acceptors is given by Alg. 1. Note that in the E-step, the marginal probabilities $p_{X}(x|\theta)$ can be efficiently computed by the sum-product decomposition (3). In the M-step, the maximal estimate $\hat{\theta}$ can be calculated directly by using Prop. 4.1. In the compare step, it can be shown that the inequality $\ell_{X}(\hat{\theta})\geq\ell_{X}(\theta)$ always holds [7, 17].

The structure of stochastic acceptors allows a more efficient implementation of the EM algorithm which amounts to a variant of the Baum-Welch algorithm [7, 18]. To see this, let $n\geq 1$ be an integer. Let $u=(u_{x})\in{\mathbb{N}}^{l^{\prime n}}$ be a data vector, where $u_{x}$ is the number of times the input sequence $x\in\Sigma^{n}$ is observed in the sample set. The full data vector $U=(u_{x,s})\in{\mathbb{N}}^{l^{\prime n}\times l^{n+1}}$ is not available, where $u_{x,s}$ denotes the number of times the pair $(x,s)\in\Sigma^{n}\times S^{n+1}$ is observed. The EM algorithm estimates in the E-step the counts of the full data vector by the quantity

[TABLE]

These counts provide the sufficient statistic $v$ of the model and are used in the M-step to obtain updated parameter values based on the solution of the maximum likelihood problem in Prop. 4.1. The expected values of the sufficient statistic $v$ can be written in a way that leads to a more efficient implementation of the EM algorithm using dynamic programming.

For this, we introduce socalled forward and backward probabilities. The forward probability

[TABLE]

where $s\in S$ and $1\leq i\leq n$ , is the joint probability that the prefix $x_{1}\ldots x_{i}$ of the observed input sequence $x\in\Sigma^{n}$ having length $i$ ends in state $s$ . For simplicity, assume that the initial distribution of $S_{1}$ is uniform; i.e., $p_{S_{1}}(s)=\frac{1}{l}$ for all $s\in S$ . Then we put $f_{x,s}(0)=\frac{1}{l\cdot l^{\prime n}}$ .

The backward probability

[TABLE]

where $s\in S$ and $0\leq i\leq n-1$ , is the conditional probability that the suffix $x_{i+1}\ldots x_{n}$ of the observed input sequence $x\in\Sigma^{n}$ having length $n-i$ starts in state $s$ .

The marginal probability $p_{X|\Theta}(x|\theta)$ of the observed input sequence $x\in\Sigma^{n}$ can be calculated based on the forward probabilities,

[TABLE]

Note that the forward and backward probabilities can be recursively computed. To see this, consider for the input sequence $x\in\Sigma^{n}$ the $l\times n$ matrices $F_{x}=(f_{x,s}(i))_{s,i}$ and $B_{x}=(b_{x,s}(i))_{s,i}$ corresponding to the forward and backward probabilities, respectively. The entries of the matrices $F_{x}$ and $B_{x}$ can be efficiently calculated in an iterative manner,

[TABLE]

and

[TABLE]

Proposition 4.2.

In view of the sufficient statistic $v$ , we have for all $s,s^{\prime}\in S$ and $a\in\Sigma$ ,

[TABLE]

Proof.

Let $I_{A}$ denote the indicator function of a proposition $A$ ; i.e., $I_{A}=1$ if $A$ is true and $I_{A}=0$ otherwise. For each state sequence $\sigma\in S^{n+1}$ , we have

[TABLE]

Thus in view of (34), we obtain

[TABLE]

The innermost term is the sum of all probabilities of pairs $(x,\sigma)$ for an input sequence $x$ and all state sequences $\sigma$ such that $\sigma_{i}\sigma_{i+1}=ss^{\prime}$ and $x_{i}=a$ . That is, observing the input sequence $x$ and a transition from state $s$ to state $s^{\prime}$ at position $i$ with $x_{i}=a$ . Thus we have

[TABLE]

The result follows. ∎

The proposition shows that the calculation of the forward and backward probability matrices yields directly the sufficient statistic without the need to estimate the counts $U=(u_{x,s})$ . This amounts to the Baum-Welch algorithm (Alg. 2). On the other hand, the EM algorithm requires to maintain the $l^{\prime n}\times l^{n+1}$ data set $U=(u_{x,s})$ from which the sufficient statistic can be established.

Bibliography18

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. Barber, Bayes Reasoning and Machine Learning , Cambridge Univ. Press, Cambridge (2012).
2[2] R. Bellman, Dynamic Programming , Dover Publications, Mineola N.Y. (2003).
3[3] V. Claus, Stochastische Automaten , Teubner, Stuttgart (1971).
4[4] A. Kondacs, J. Watrous, On the power of quantum finite state automata, Proc. 38th Annual Symposium on the Foundations of Computer Science , (1997), 66-75. http://dx.doi.org/10.1109/SFCS.1997.646094
5[5] J. von Neumann, Probabilistic logic and the synthesis of reliable organisms from unreliable components, in: Automata Studies, C. Shannon and J. Mc Carthy (eds), Annals of Mathematical Studies , 34 , Princeton Univ. Press, Princeton, NJ (1956). http://dx.doi.org/10.1515/9781400882618-003
6[6] J. W. Carlyle, Reduced forms for stochastic sequential machines, Journal Mathematical Analysis and Applications , 7 , No. 2 (1963), 167-165. http://dx.doi.org/10.1016/0022-247X(63)90045-3
7[7] L. Pachter, B. Sturmfels, Algebraic Statistics for Computational Biology , Cambridge Univ. Press, Cambridge (2005).
8[8] M. O. Rabin, Probabilistic automata, Information and Control , 6 , No. 3 (1963), 230-245. http://dx.doi.org/10.1016/S 0019-9958(63)90290-0

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Computations in Stochastic Acceptors

Abstract

1 Introduction

2 Mathematical Preliminaries

Proposition 2.1**.**

Proposition 2.2**.**

3 Input Marginals and Acceptance Probabilities

4 Parameter Estimation

Proposition 4.1**.**

Proof.

Proposition 4.2**.**

Proof.

Proposition 2.1.

Proposition 2.2.

Proposition 4.1.

Proposition 4.2.