Computations in Stochastic Acceptors
Karl-Heinz Zimmermann

TL;DR
This paper introduces dynamic programming algorithms for stochastic acceptors, enabling computation of input marginals, acceptance probabilities, and parameter estimation using EM and Baum-Welch algorithms.
Contribution
It provides novel algorithms for probabilistic automata, including efficient parameter estimation methods, advancing their application in machine learning contexts.
Findings
Algorithms for input marginal computation
Acceptance probability calculation methods
Efficient EM-based parameter estimation
Abstract
Machine learning provides algorithms that can learn from data and make inferences or predictions on data. Stochastic acceptors or probabilistic automata are stochastic automata without output that can model components in machine learning scenarios. In this paper, we provide dynamic programming algorithms for the computation of input marginals and the acceptance probabilities in stochastic acceptors. Furthermore, we specify an algorithm for the parameter estimation of the conditional probabilities using the expectation-maximization technique and a more efficient implementation related to the Baum-Welch algorithm.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · semigroups and automata theory · Algorithms and Data Compression
Computations in Stochastic Acceptors
Karl-Heinz Zimmermann111Email: [email protected]
Department of Electrical Engineering, Computer Science, Mathematics
Hamburg University of Technology
21071 Hamburg, Germany
Abstract
Machine learning provides algorithms that can learn from data and make inferences or predictions on data. Stochastic acceptors or probabilistic automata are stochastic automata without output that can model components in machine learning scenarios. In this paper, we provide dynamic programming algorithms for the computation of input marginals and the acceptance probabilities in stochastic acceptors. Furthermore, we specify an algorithm for the parameter estimation of the conditional probabilities using the expectation-maximization technique and a more efficient implementation related to the Baum-Welch algorithm.
AMS Subject Classification: 68Q70, 68T05
Keywords: Probabilistic automaton, dynamic programming, parameter estimation, EM algorithm, Baum-Welch algorithm
1 Introduction
The theory of discrete stochastic systems has been first studied by Shannon [14] and von Neumann [5]. Shannon has considered memory-less communication channels and their generalization by introducing states, while von Neumann has investigated the synthesis of reliable systems from unreliable components. The seminal research work of Rabin and Scott [9] about deterministic finite-state automata has led to two generalizations. First, the generalization of transition functions to conditional distributions studied by Carlyle [6] and Starke [15]. Second, the generalization of regular sets by introducing stochastic acceptors as described by Rabin [8].
A stochastic acceptor or probabilistic automaton is a stochastic automaton without output [3, 13, 18]. It generalizes the nondeterministic finite automaton by involving the probability of transition from one state to another and in this way generalizes the concept of Markov chain. The languages accepted by stochastic acceptors are called stochastic languages. The class of stochastic languages is uncountable and includes the regular languages as a proper subclass.
Stochastic automata have widespread use in the modeling of stochastic systems such as in traffic theory and in spoken language understanding for the recognition and interpretation of speech signals [3, 12, 10]. They can be used as building blocks in situations of machine learning where detailed mathematical description is missing and feature management is noisy. The arrangement of stochastic automata in the form of teams or hierarchies could lead to solutions of complex inference problems [16].
Stochastic acceptors have been generalized to a quantum analog, the quantum finite automaton [4]. The latter are linked to quantum computers as stochastic acceptors are connected to conventional computers.
In this paper, we provide dynamic programming algorithms for the computation of input marginals and the acceptance probabilities in a stochastic acceptor. Moreover, we specify an algorithm for the parameter estimation of the conditional probabilities using the expectation-maximization technique and a variant of the Baum-Welch algorithm. The text is to a large extent self-contained and also suitable to non-experts in this field.
2 Mathematical Preliminaries
A stochastic acceptor (SA) [3, 8, 13] is a quintuple , where is a nonempty finite set of states, is an alphabet of input symbols, is a collection of stochastic matrices, where is the number of states, is the initial distribution of the states written as row vector, and is a binary column vector of length called final state vector.
Let be the state set. Then the final state vector is and is the final state set. Moreover, the matrices with are transition probability matrices, where the th entry is the conditional probability of transition from state to state when the symbol is read, . Thus for each symbol and each state ,
[TABLE]
Given a conditional probability distribution on , a probability distribution on can be defined recursively as follows.
- •
For each ,
[TABLE]
where denotes the empty word in .
- •
For all , , and ,
[TABLE]
Then is a conditional probability distribution on and so we have
[TABLE]
Note that the measures and coincide on the set if we put in (5). Therefore, we write instead of .
A stochastic acceptor works serially and synchronously. It reads an input word symbol by symbol and after reading an input symbol it transits into another state. In particular, if the automaton starts in state and reads the word , then with probability it will end in state taking all intermediate states into account.
Proposition 2.1**.**
For all and ,
[TABLE]
This result can be described by probability matrices. To this end, for the empty word put
[TABLE]
where is the unit matrix. Furthermore, if and , then by (5)
[TABLE]
By Prop. 2.1 and the associativity of matrix multiplication, we obtain the following
Proposition 2.2**.**
For all ,
[TABLE]
It follows by induction that if , then
[TABLE]
Let be a stochastic acceptor and let be a real number with . The set
[TABLE]
is the language of w.r.t. , and is the cut point of .
- Example 1.
Let be an integer. Consider the -adic stochastic acceptor with
[TABLE]
See Fig. 1. Each word can be assigned the real number whose -adic representation is . For each cut point , the accepted language is
[TABLE]
Note that the language is regular if and only if the cut point is rational [3, 8, 9].
For each input word , the stochastic matrix can be viewed as generating a discrete-time Markov chain. Thus the behavior of a stochastic automaton is an interleaving of Markov chains each of which corresponding to a single input symbol.
3 Input Marginals and Acceptance Probabilities
The input marginals and the acceptance probabilities can be computed by the technique of dynamic programming [2] using sum-product decomposition.
To see this, let be a stochastic acceptor with -element state set and -element input set . A stochastic acceptor can be viewed as a belief network. To this end, let be an integer. Let be random variables with common state set and let be random variables with common state set . The stochastic acceptor can be described for inputs of length by the belief network [1, 11, 18] as shown in Fig. 2. Then the corresponding joint probability distribution factoring according to the network is given by
[TABLE]
We assume for simplicity that the initial distributions are uniform; i.e., for all and . Moreover, the network is assumed to be homogeneous in the sense that the conditional distributions are independent of the index , . Therefore, we put
[TABLE]
It follows that the joint probability distribution has the form
[TABLE]
The probability of an input sequence is given by the marginal distribution
[TABLE]
The corresponding sum-product decomposition yields
[TABLE]
According to this decomposition, the marginal probability can be calculated by using an table :
[TABLE]
The time complexity of this algorithm is , since the table has size and each table entry is computed in steps. The marginal probabilities will be used in the EM and BM algorithms later on.
On the other hand, the acceptance probability of an input sequence is given by the sum-product decomposition
[TABLE]
This decomposition can be used to compute the acceptance probability by using an table :
[TABLE]
Similarly, the time complexity of this algorithm is , since the table has size and each table entry is computed in steps.
4 Parameter Estimation
The objective is to estimate the conditional probabilities of a stochastic acceptor by using sample data. For this, the stochastic acceptor is viewed as a belief network as described in the previous section. For this, let be a stochastic acceptor with and , and let . Take the parameter set
[TABLE]
where
[TABLE]
The aim is to estimate these probabilities by making use of a sample set. For this, assume that there is a collection of independent samples called database, where denotes the -th sample, . For simplicity, suppose the initial distributions are uniform as before, . Then the joint probability of the sample depending on the parameters is given by
[TABLE]
Thus the likelihood function is given by
[TABLE]
where is the number of times the input-state pair is observed in the sample set. Therefore, we have
[TABLE]
Let be the number of times the parameter occurs in the likelihood function . Then the likelihood function can be written (up to a constant) as
[TABLE]
The corresponding log-likelihood function is
[TABLE]
The data form the sufficient statistic of the model. These data can be obtained from the given data by the linear transformation
[TABLE]
where is an integral matrix with rows labeled by the triples with and . Moreover, the matrix has columns labeled by the pairs . The matrix has entry in row and column if the parameter occurs times in . Note that the matrix has column sum , since the quantity has factors.
- Example 2.
Consider the 2-adic stochastic acceptor with state set and input set , and let . The associated matrix is as follows,
[TABLE]
Proposition 4.1**.**
The maximum likelihood estimate of the likelihood function is given by
[TABLE]
Proof.
Let and . For each input-state pair , , , we have
[TABLE]
The parameters with appear in the log-likelihood function as the partial sum
[TABLE]
Using , the partial derivative of with respect to becomes
[TABLE]
Equating this expression to 0 gives as claimed. Thus the vector is a critial point of the likelihood function.
Claim that this point maximizes the likelihood function; the proof idea goes back to Koski et al. [11]. Indeed, let denote the entropy of a probability distribution and let denote the Kullback-Leibler measure between two probability distributions and . Then we have
[TABLE]
where , and for each input-state pair . Since the Kullback-Leibler measure is always non-negative [11], we obtain
[TABLE]
This proves the claim and the result follows. ∎
A stochastic acceptor is an abstract machine with an input interface. Therefore, suppose the sample data consist only of the input sequences, while the observer has no access to the state sequences. This problem can be tackled by the expectation-maximization (EM) algorithm. This is an iterative method to find the maximum posterior estimates of parameters in a statistical model with unobserved latent variables.
The aim is to estimate these probabilities by making use of a sample set. For this, let be a stochastic acceptor in the above setting and let . We assume that there is a collection of independent samples called database, where denotes the -th input sample, . Then the probability of the sample depending on the parameters is given by the marginal distribution
[TABLE]
The likelihood function is given by
[TABLE]
and the log-likelihood function is
[TABLE]
where is the number of times the input sequence is observed in the sample set. Therefore, we have
[TABLE]
A version of the EM algorithm for stochastic acceptors is given by Alg. 1. Note that in the E-step, the marginal probabilities can be efficiently computed by the sum-product decomposition (3). In the M-step, the maximal estimate can be calculated directly by using Prop. 4.1. In the compare step, it can be shown that the inequality always holds [7, 17].
The structure of stochastic acceptors allows a more efficient implementation of the EM algorithm which amounts to a variant of the Baum-Welch algorithm [7, 18]. To see this, let be an integer. Let be a data vector, where is the number of times the input sequence is observed in the sample set. The full data vector is not available, where denotes the number of times the pair is observed. The EM algorithm estimates in the E-step the counts of the full data vector by the quantity
[TABLE]
These counts provide the sufficient statistic of the model and are used in the M-step to obtain updated parameter values based on the solution of the maximum likelihood problem in Prop. 4.1. The expected values of the sufficient statistic can be written in a way that leads to a more efficient implementation of the EM algorithm using dynamic programming.
For this, we introduce socalled forward and backward probabilities. The forward probability
[TABLE]
where and , is the joint probability that the prefix of the observed input sequence having length ends in state . For simplicity, assume that the initial distribution of is uniform; i.e., for all . Then we put .
The backward probability
[TABLE]
where and , is the conditional probability that the suffix of the observed input sequence having length starts in state .
The marginal probability of the observed input sequence can be calculated based on the forward probabilities,
[TABLE]
Note that the forward and backward probabilities can be recursively computed. To see this, consider for the input sequence the matrices and corresponding to the forward and backward probabilities, respectively. The entries of the matrices and can be efficiently calculated in an iterative manner,
[TABLE]
and
[TABLE]
Proposition 4.2**.**
In view of the sufficient statistic , we have for all and ,
[TABLE]
Proof.
Let denote the indicator function of a proposition ; i.e., if is true and otherwise. For each state sequence , we have
[TABLE]
Thus in view of (34), we obtain
[TABLE]
The innermost term is the sum of all probabilities of pairs for an input sequence and all state sequences such that and . That is, observing the input sequence and a transition from state to state at position with . Thus we have
[TABLE]
The result follows. ∎
The proposition shows that the calculation of the forward and backward probability matrices yields directly the sufficient statistic without the need to estimate the counts . This amounts to the Baum-Welch algorithm (Alg. 2). On the other hand, the EM algorithm requires to maintain the data set from which the sufficient statistic can be established.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] D. Barber, Bayes Reasoning and Machine Learning , Cambridge Univ. Press, Cambridge (2012).
- 2[2] R. Bellman, Dynamic Programming , Dover Publications, Mineola N.Y. (2003).
- 3[3] V. Claus, Stochastische Automaten , Teubner, Stuttgart (1971).
- 4[4] A. Kondacs, J. Watrous, On the power of quantum finite state automata, Proc. 38th Annual Symposium on the Foundations of Computer Science , (1997), 66-75. http://dx.doi.org/10.1109/SFCS.1997.646094
- 5[5] J. von Neumann, Probabilistic logic and the synthesis of reliable organisms from unreliable components, in: Automata Studies, C. Shannon and J. Mc Carthy (eds), Annals of Mathematical Studies , 34 , Princeton Univ. Press, Princeton, NJ (1956). http://dx.doi.org/10.1515/9781400882618-003
- 6[6] J. W. Carlyle, Reduced forms for stochastic sequential machines, Journal Mathematical Analysis and Applications , 7 , No. 2 (1963), 167-165. http://dx.doi.org/10.1016/0022-247X(63)90045-3
- 7[7] L. Pachter, B. Sturmfels, Algebraic Statistics for Computational Biology , Cambridge Univ. Press, Cambridge (2005).
- 8[8] M. O. Rabin, Probabilistic automata, Information and Control , 6 , No. 3 (1963), 230-245. http://dx.doi.org/10.1016/S 0019-9958(63)90290-0
