
TL;DR
This paper explores the geometric structure of the Arimoto algorithm in information theory and introduces a new algorithm, the Backward em-algorithm, that monotonically increases Kullback-Leibler divergence, with broad potential applications.
Contribution
It reveals the information geometric structure of the Arimoto algorithm and proposes the Backward em-algorithm for increasing Kullback-Leibler divergence.
Findings
Revealed geometric structure of Arimoto algorithm
Proposed the Backward em-algorithm for divergence increase
Potential applications in statistics and information theory
Abstract
In information theory, the channel capacity, which indicates how efficient a given channel is, plays an important role. The best-used algorithm for evaluating the channel capacity is Arimoto algorithm. This paper aims to reveal an information geometric structure of Arimoto algorithm. In the process of trying to reveal an information geometric structure of Arimoto algorithm, a new algorithm that monotonically increases the Kullback-Leibler divergence is proposed, which is called "the Backward em-algorithm." Since the Backward em-algorithm is available in many cases where we need to increase the Kullback-Leibler divergence, it has a lot of potential to be applied to many problems of statistics and information theory.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Geometry of Arimoto algorithm
Shoji Toyota
abstract
In information theory, the channel capacity, which indicates how efficient a given channel is, plays an important role. The best-used algorithm for evaluating the channel capacity is Arimoto algorithm [4]. This paper aims to reveal an information geometric structure of Arimoto algorithm. In the process of trying to reveal an information geometric structure of Arimoto algorithm, a new algorithm that monotonically increases the Kullback-Leibler divergence is proposed, which is named “the Backward em-algorithm.” Since the Backward em-algorithm is available in many cases where we need to increase the Kullback-Leibler divergence, it has a rich potential for application to many problems of statistics and information theory.
Contents
1 Introduction
Since C. E. Shannon proposed the notion of channel capacity [1], it has played an important role in information theory. Given a channel (), the channel capacity is defined as follows:
[TABLE]
Here, and denote finite sets and denotes a conditional probability on for . The symbol denotes the mutual information of and denotes the set of all probability distributions on .
Arimoto algorithm [4] is known as the best-used algorithm for evaluating the channel capacity of a memoryless channel, where we update in order that increases. Although many people have proposed other algorithms (e.g., [11]), they are essentially the same as Arimoto algorithm. It implies that Arimoto algorithm is not just an algorithm but has some generic structure. The purpose of this paper is to reveal a theoretical justification of Arimoto algorithm from the information geometric point of view.
There exist papers whose purpose are similar to the present paper, for example [6], [7], [8] and [9]. But [6] and [7] mention only the channel capacity but not Arimoto algorithm. Although [8] refers to Arimoto algorithm, we think it does not sufficiently explain a theoretical justification of Arimoto algorithm from the information geometric point of view (see Section 4 for more information). The paper [9] tries to interpret Arimoto algorithm by using the Kullback-Leibler divergence. But, to do so, [9] expands its domain outside of the probability simplex. Since Information Geometry is conventionally geometric structures on the probability simplex (, “inside” of the probability simplex), to reveal information geometric view of Arimoto algorithm, further studies are needed. Since our analysis is inside of the probability simplex, it can be said that we deal with more generic information geometric view than previous studies.
This paper is organized as follows. In Section 2, we summarize some terminologies and results of Information Geometry. In Section 3, we explain the channel capacity and Arimoto algorithm. Information geometric view of a channel capacity is investigated in Section 4. In Section 5, we propose an algorithm naturally induced from the information geometric view of a channel capacity addressed in Section 4, and prove that this algorithm corresponds to Arimoto algorithm. We conclude the paper with brief remarks in Section 6.
2 Information Geometry
In a narrow sense, Information Geometry on a finite model is a dually flat structure on the probability simplex. In this section, we summarize one of the dually flat structure used in the present paper.
Definiton 2.1**.**
Let N be a manifold, be a Rieamannian metric on and , be affine connections on . We call the triple an dually structure on if
[TABLE]
holds. Here, denotes the set of all vector fields on . Especially, if and are flat, we call the triple a dually flat structure on N.
Let be a finite set. We can regard the probability simplex
[TABLE]
as an -dimensional submanifold of . The Fisher metric and the m-connection and the e-connection on are defined as follows:
[TABLE]
Here, denotes the Levi-Civita connection of and denotes the -tensor on defined by
[TABLE]
Note that the triple is a dually flat structure on [2, p.35, p36, Theorem 3.1]. and have the global affine coordinate systems and defined as follows:
[TABLE]
[TABLE]
-geodesics and -geodesics have the following interesting property.
Theorem 2.2**.**
[2, Theorem 3.8]** Let and be elements of . Asssume that the -geodesic connecting and and -geodesic connecting and are orthogonal at q. Then,
[TABLE]
holds. Here, denotes the Kullback Leibler divergence defined as follows:
[TABLE]
The above theorem is called the generalized Pythagorean theorem.
Next, we define -projections and -projections.
Definiton 2.3**.**
Let be a submanifold of and . We call a - resp. - projection of onto if the - resp. -geodesic connecting and are “orthogonal” to (with respect to the Fisher metric ) at .
In general, a -projection nor a -projection is unique. But, if has the following property, the projection becomes unique.
Definiton 2.4**.**
Let be a submanifold of . We say that is -autoparallel if, for any , . Similarly, is said to be -autoparallel if, for any , . Here, denotes the tangent space of at embedded into the tangent space .
Theorem 2.5**.**
[2, Theorem 3.9]** Let and be -autoparallel and -autoparallel submanifolds in respectively. Let . Then, a necessary and sufficient condition for to be a -projection of onto is that satisfies
[TABLE]
And the -projection onto is unique if it exists.
Similarly, a necessary and sufficient condition for to be a -projection of onto is that satisfies
[TABLE]
And the -projection onto is unique if it exists.
We often need to investigate whether or not a submanifold of is or -autoparallel. The following theorem gives a sufficient condition for to be - and -autoparallel.
Theorem 2.6**.**
Assume that, for any and , the element
[TABLE]
belong to .
Then, is - autoparallel.
Let be a submanifold of . Assume that, for any and , the element for which
[TABLE]
belong to . Then is -autoparallel. Here, the constant , which is independent of , is defined by
[TABLE]
To prove Theorem 2.6, we need the following lemmma.
Lemma 2.7**.**
[3, Theorem 3.7.3]** Let with . Let be an n-dimensional flat manifold with respect to the affine connection and there exists a global affine coordinate system of . A necessary and sufficient condition for an m-dimensional submanifold of to be autoparallel is that there exists a local coordinate system , an ()-matrix such that and which satisfy
[TABLE]
Proof of Theorem 2.6.
Assume that satisfies the equation (2). Then is convex with respect to the -affine coordinate system . Fix . Take . If , we can take such that and are linearly independent. Repeating this, we can take such that are linearly independent. Define the “hyperplane” (with respect to ) by
[TABLE]
Since are linearly independent, we can see that . Noting that is a submanifold of and , we can see that there exists a local coordinate system of such that
[TABLE]
holds. From Lemma 2.7, we can see that is -autoparallel. The proof of the latter half is same as the above proof. ∎
3 Channel capacity and Arimoto algorithm
In this paper, let () be finite sets, be the sets of all probability distributions on . Namely,
[TABLE]
where . Similarly, let be the set consisting of all probability distributions on .
A memoryless channel is expressed by a system where, for an input symbol , an output symbol is determined at random.
Definiton 3.1**.**
A channel is defined by a triple of finite sets and a map .
Definiton 3.2**.**
We call the map defined by
[TABLE]
the mutual information. In the equation , and mean the marginal distributions of on and respectively.
Definiton 3.3**.**
Given a channel , the channel capacity is defined by
[TABLE]
Arimoto algorithm is to update from to
[TABLE]
where means the marginal distribution of . It is known that, by using this algorithm, monotonically increases and converges to the channel capacity [4, Theorem 2].
4 Information geometric view of channel capacity in
Let us try to characterize the channel capacity from the information geometric point of view. In [6] and [4], the the channel capacity in is referred to. Let us review their outline. A probability distribution that attains the channel capacity satisfies the following interesting condition:
Theorem 4.1**.**
[4, Lemma 1]** [6, p.554–555] Assume that a probability distribution attains the channel capacity . Then satisfies the following condition:
[TABLE]
where denotes the marginal distribution of on . Conversely, if there exist and satisfying
[TABLE]
then and are the channel capacity and a probability distribution that attains the channel capacity, respectively.
The proof is given in Section 7.1 for convenience’ sake. Theorem 4.1 tells us that, from information geometric view of , the channel capacity is a “circumcenter” of the polyhedron spanned by .
[8] refers to an information geometric interpretation of Arimoto algorithm in , using the result of Theorem 4.1:
Given a current guess , we should check the Kullbuck-Leibler divergences and move the output distribution closer to those for which is large. This can be achived by increasing the respective weights , consistent with the recursion (5) that increases (decreases) those input probabilities for which is above (below) the average .
Although the explanation seems to be valid intuitively, it does not seems to succeed in revealing the behavior of in as is updated accurately. Therefore, in our opinion, further researches are needed to reveal the information geometric view of Arimoto algorithm.
In this section, we reconsider the information geometric view of the channel capacity in . We may be able to see some interesting structure in which is hidden in .
Define subsets and of by
[TABLE]
From Theorem 2.6, we can see that is -autoparallel and is -autoparallel.
Lemma 4.2**.**
For , the -projection of onto is , where and are defined by
[TABLE]
that is, and are the marginal distributions of .
The proof is given in Section 7.2. By utilizing Lemma 4.2, the channel capacity is expressed as follows:
[TABLE]
where means the -projection of onto . The formula (8) says that, from the viewpoint of geometry in , the channel capacity C is the longest “distance” (between and ) from to (Fig. 1).
5 Backward em-algorithm
In Section 4, we reveal an information geometric structure of the channel capacity in . Therefore, if we can make an algorithm monotonically increasing the Kullback-Leibler divergence, we can expect that this algorithm is useful for evaluating the channel capacity.
An algorithm, which monotonically decreases the Kullback-Leibler divergence, is well known as “the em-algorithm” [10]. Then how can we increase the Kullback-Leibler divergence? It will be a strong candidate to project onto a ()-autoparallel submanifold by a ()-geodesic. But since this projection is a critical point of the Kullback-Leibler divergence, this may sometimes decrease the Kullback-Leibler divergence. Hence, an algorithm that uses this idea is not necessarily a steady algorithm that increases the Kullback-Leibler divergence and converges to the channel capacity .
To overcome this difficulty, let us try to use the idea that rewinds the em-algorithm, same as rewinding movie films!
Definiton 5.1**.**
Define , and in the same way as Section 3. For , update as follows:
- Backward e-step.
Search such that the unique -projection from onto M is .
- Backward m-step.
Search such that the unique -projection from onto E is .
We call this algorithm “the Backward em-algorithm” (See Fig.1).
Theorem 5.2**.**
By using the Backward em-algorithm, increases as is updated. Namely, the following equality
[TABLE]
holds.
Proof.
[TABLE]
Note that the second and third equalities follow from the generalized Pythagorean theorem. ∎
Although we define the Backward em-algorithm, we can not determine whether or not there exist any probability distributions which satisfy for a given probability distribution . Therefore it is not trivial that we can carry out the Backward e-step. For , do there exist any probability distributions which satisfy ? And if any, can we write explicitly? The following theorem answers positively to the above two questions.
Theorem 5.3**.**
Let . Then the following two statements for and are equivalent:
1. satisfies
[TABLE]
where denotes the -projection from onto .
2. satisfies
[TABLE]
Proof.
Fix contained in . Define by
[TABLE]
Noting that
[TABLE]
we see that (9) is equivalent to the following:
[TABLE]
Observing that
[TABLE]
we can see that
[TABLE]
which concludes the proof. ∎
From Theorem 5.3, we can deduce the following interesting theorem.
Theorem 5.4**.**
The subset of defined by
[TABLE]
is -autoparallel.
Proof.
It suffices to prove that, for any and contained in and any with , there exists contained in satisfying
[TABLE]
where the normalization term is defined by
[TABLE]
Calculating the left-hand side (LHS) of (11), we obtain
(LHS) .
Let us calculate . Noting that the pairs () and () satisfy (10),
[TABLE]
where the normalization factors and are defined by
[TABLE]
Define by
[TABLE]
where
[TABLE]
Then we can see that can be rewritten as follows by using defined by (12):
[TABLE]
Set by
[TABLE]
where . Then, we obtain
[TABLE]
Hence
[TABLE]
holds, and therefore it concludes the proof. ∎
Theorem 5.3 and Theorem 5.4 tell us that, for any probability distribution , we can carry out the Backward e-step and the set of candidates for the Backward e-step is an exponential family.
Next, let us consider whether or not we can carry out the Backward m-step. Which element should we choose in to carry out the Backward m-step? That is, what are conditions of that there exists such that holds?
To investigate this question, let be the embedding of into by -projection. Assume that there exist any intersections of with (its existence and uniqueness is discussed in Section 6). Let . Then, for , we can carry out the Backward m-step. Conversely, assume that, for , we can carry out the Backward m-step. Then, . Hence, the problem of searching where the Backward m-step can be carried out is equivalent to the one of searching any intersections of with .
The element is only depend on because, for a given , the requirement that determine by the equation (10). Therefore, from now on, we may see as as a function of determined by the requirement that (, the equation (10)). Taking it into consideration, we may consider the condition of such that
[TABLE]
holds. Noting that (See Theorem 4.2), where denotes the marginal distribution of , the condition (14) of is equivalent to the following condition:
[TABLE]
The above condition comes down to solving the following nonlinear equation with respect to :
[TABLE]
Rewritting this as
[TABLE]
[TABLE]
we see that it is difficult to solve the nonlinear equation (16) with respect to . If we can solve the equation (16), we can prove that converges to the channel capacity . The proof is given in Section 7.4.
As it is difficult to solve the equation (16) with respect to , we try to approximate (16) in order that we can solve. It will be a good solution to approximate to some value that is independent of since it becomes a constant value. It seems good to approximate of to the “circumcenter” of the figure induced from in , that is, the probability distribution contained in that attains the channel capacity (see Theorem 4.1). Then, observing that becomes independent of , (14) is rewtitten as
[TABLE]
that can be solved. The merit of this approximation is that has dissapearred in the equation (17). Namely, even if we do not know the value of , we can solve (17). Since the solution of (17) is , the approximation designates the element of by where is defined by
[TABLE]
In the present paper, we call the approximation the approximate Backward e-step.
By the above approximation, we can solve the equation (16). But, in return for the approximation, is not necessarily an intersection of with , and therefore we need to approximate the Backward m-step too. In the present paper, we approximate the Backward m-step by the -projection of onto . A short computation shows that
[TABLE]
The proof is given in Section 7.3. We call this approximation the approximate Backward m-step.
Combining the approximate Backward m- and e-steps, is updated by , and therefore, is updated by , which is nothing but Arimoto algorithm (Fig. 2).
6 Concluding Remarks
In the present paper, we investigated the channel capacity from the information geometric point of view in . Then, we introduced the new algorithm that monotonically increases the Kullback-Leibler divergence, “the Backward em-algorithm.” The Backward e-step can be determined but the Backward m-step cannot be. Hence, we tried to approximate the Backward m-step, which corresponds to Arimoto algorithm.
There are many open problems left. First, existence and uniqueness of an intersection of with should be studied. To research the problem, we may consider the uniqueness and existence of a solution in the equation (16). If we can prove that there exists a solution of the equation (16), even if we cannot solve, we may be able to introduce other approximations of the Backward e- and m-steps and accelerate Arimoto algorithm.
It seems interesting to apply the Backward em-algorithm to other subjects. In our knowledge, there has been no algorithm that monotonically increases the Kullback-Leibler divergence. We can use the Backward em-algorithm when we want to increase the Kullback-Leibler divergence between two manifolds. For example, in the field of independent component analysis and machine learning, we often need to increase the mutual information (e.g., [12, 13]). In these situations, there is a possibility that the Backward em-algorithm works well because information geometric view of the mutual information is the Kullback-Leibler divergence between two manifolds.
Acknowledgments
The author gratefully acknowledges the continuous encouragement from Toru Ohira and Hideyuki Ishi. The author also thanks Professors Amor Keziou, Hiroshi Matsuzoe, Masahito Hayashi, Shiro Ikeda and Phillippe Regnault for their helpful discussions and comments.
7 Appendices
7.1 Proof of Theorem 4.1
(First half): Define a function by
[TABLE]
where means a Lagrange multiplier. For the mutual information to take a maximum point at , it is necessary that
[TABLE]
Since , (19) is rewritten as
[TABLE]
and it follows immediately that corresponds to the channel capacity and the relation (6) holds.
(Second half): The minmax redundancy, defined by
[TABLE]
coincides with the channel capacity [5, Theorem13.1.1], where is the marginal distribution of . Since, for satisfying (7), the equality holds, it follows that . Noting that the satisfies and taking the definition of the channel capacity into the consideration, it also follows that , and therefore, .
7.2 Proof of Lemma 4.2
Take any contained in . Then
[TABLE]
holds and the lower bound [math] is attained if and only if (since holds [5, p.31]). Observing that the -projection onto the -autoparallel submanifold is characterized by
[TABLE]
it concludes the proof.
7.3 Proof of the equation (18)
It suffices to prove the following lemma.
Lemma 7.1**.**
Let . Then is one of the candidates for - projections of onto .
Note that since is not -autoparallel, a - projection onto is not necessarily unique.
Proof.
Take any contained in . Then
[TABLE]
holds and the lower bound [math] is attained if and only if . Observing that, if
[TABLE]
holds, is one of the candidates for - projections of onto [2, Theorem 3.10], it concludes the proof. ∎
7.4 Convergence of the Backward em-algorithm
In this section, we assume that the equation (16) can be solved and that can be updated to any number of times by the Backward em-algorithm.
Theorem 7.2**.**
* converges to the channel capacity as is updated.*
Lemma 7.3**.**
Let . Then,
[TABLE]
where .
Proof.
[TABLE]
∎
Proof of Theorem 7.2. It suffices to prove that converges to the channel capacity . Let be a probability distribution that attains the channel capacity . First, let us prove that
[TABLE]
Calculating , we obtain
[TABLE]
and therefore, we obtain the inequality (20), where denotes the marginal distribution of on . Summing up the both sides of the inequality (20), we have
[TABLE]
Noting that and is independent of , we can see that the sequence converges [math]. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] C. E. Shannon,“A Mathematical Theory of Communication,” Bell System Technical Journal, vol.27, 379–423 and 623–656 (1948).
- 2[2] S. Amari and H. Nagaoka, M e t h o d s o f I n f o r m a t i o n G e o m e t r y 𝑀 𝑒 𝑡 ℎ 𝑜 𝑑 𝑠 𝑜 𝑓 𝐼 𝑛 𝑓 𝑜 𝑟 𝑚 𝑎 𝑡 𝑖 𝑜 𝑛 𝐺 𝑒 𝑜 𝑚 𝑒 𝑡 𝑟 𝑦 Methods~{}of~{}Information~{}Geometry (AMS and Oxford, 2000).
- 3[3] A. Fujiwara, “Foundations of Information Geometry (Makino Shoten, Tokyo, 2015),” in Japanese.
- 4[4] S. Arimoto, “An algorithm for computing the capacity of arbitrary discrete memoryless channels,” IEEE Trans. Inf. Theory, vol.18, 14–20 (1972).
- 5[5] T. M. Cover and J. A. Thomas, E l e m e n t s o f I n f o r m a t i o n T h e o r y , 2 n d e d . 𝐸 𝑙 𝑒 𝑚 𝑒 𝑛 𝑡 𝑠 𝑜 𝑓 𝐼 𝑛 𝑓 𝑜 𝑟 𝑚 𝑎 𝑡 𝑖 𝑜 𝑛 𝑇 ℎ 𝑒 𝑜 𝑟 𝑦 2 𝑛 𝑑 𝑒 𝑑 Elements~{}of~{}Information~{}Theory,~{}2nd~{}ed. (Wiley, 2006).
- 6[6] J. Takeuchi and S. Ikeda, “An Information Geometrical Study on Communication Channel Capacity,” Symposium on Information Theory and its Applications, vol.33, 553–558 (2010).
- 7[7] K. Nakagawa, K. Watanabe and T. Sabu, “On the Search Algorithm for the Output Distribution That Achieves the Channel Capacity,” IEEE Trans. Inf. Theory, vol.63, 1043–1062 (2017).
- 8[8] G. Matz and P. Duhamel, “Information geometric formulation and interpretation of accelerated Blahut-Arimoto-Type algorithms,” in Proc. Information Theory Workshop, 24–29, San Antonio, Texas, October, (2004).
