Geometry of Arimoto Algorithm

Shoji Toyota

arXiv:1902.05228·cs.IT·April 4, 2022

Geometry of Arimoto Algorithm

Shoji Toyota

PDF

TL;DR

This paper explores the geometric structure of the Arimoto algorithm in information theory and introduces a new algorithm, the Backward em-algorithm, that monotonically increases Kullback-Leibler divergence, with broad potential applications.

Contribution

It reveals the information geometric structure of the Arimoto algorithm and proposes the Backward em-algorithm for increasing Kullback-Leibler divergence.

Findings

01

Revealed geometric structure of Arimoto algorithm

02

Proposed the Backward em-algorithm for divergence increase

03

Potential applications in statistics and information theory

Abstract

In information theory, the channel capacity, which indicates how efficient a given channel is, plays an important role. The best-used algorithm for evaluating the channel capacity is Arimoto algorithm. This paper aims to reveal an information geometric structure of Arimoto algorithm. In the process of trying to reveal an information geometric structure of Arimoto algorithm, a new algorithm that monotonically increases the Kullback-Leibler divergence is proposed, which is called "the Backward em-algorithm." Since the Backward em-algorithm is available in many cases where we need to increase the Kullback-Leibler divergence, it has a lot of potential to be applied to many problems of statistics and information theory.

Equations139

C := q (x) \in S_{1} sup I (q (x) \cdot r (y ∣ x)),

C := q (x) \in S_{1} sup I (q (x) \cdot r (y ∣ x)),

X g (Y, Z) = g (\nabla_{X} Y, Z) + g (X, \nabla_{Y}^{*} Z) (\forall X, Y, Z \in χ (N))

X g (Y, Z) = g (\nabla_{X} Y, Z) + g (X, \nabla_{Y}^{*} Z) (\forall X, Y, Z \in χ (N))

S := {p : Ω \to R_{++}; x \in Ω \sum p (x) = 1},

S := {p : Ω \to R_{++}; x \in Ω \sum p (x) = 1},

g_{p} (X, Y) := ω \in Ω \sum p (ω) (X lo g p (ω)) (Y lo g p (ω)),

g_{p} (X, Y) := ω \in Ω \sum p (ω) (X lo g p (ω)) (Y lo g p (ω)),

g_{p} (\nabla_{X}^{(m)} Y, Z) := g_{p} (\overline{\nabla}_{X} Y, Z) + \frac{α}{2} S_{p} (X, Y, Z),

g_{p} (\nabla_{X}^{(e)} Y, Z) := g_{p} (\overline{\nabla}_{X} Y, Z) - \frac{α}{2} S_{p} (X, Y, Z) .

S_{p} (X, Y, Z) := ω \in Ω \sum p (ω) (X lo g p (ω)) (Y lo g p (ω)) (Z lo g p (ω)) .

S_{p} (X, Y, Z) := ω \in Ω \sum p (ω) (X lo g p (ω)) (Y lo g p (ω)) (Z lo g p (ω)) .

η_{i} := p (i),

η_{i} := p (i),

θ_{j} := lo g \frac{p ( j )}{p ( ∣Ω∣ )} .

θ_{j} := lo g \frac{p ( j )}{p ( ∣Ω∣ )} .

D (p ∣∣ q) + D (q ∣∣ r) = D (p ∣∣ r)

D (p ∣∣ q) + D (q ∣∣ r) = D (p ∣∣ r)

D (p_{1} ∣∣ p_{2}) := x \in Ω \sum p_{1} (x) lo g \frac{p _{1} ( x )}{p _{2} ( x )} (p_{1}, p_{2} \in S) .

D (p_{1} ∣∣ p_{2}) := x \in Ω \sum p_{1} (x) lo g \frac{p _{1} ( x )}{p _{2} ( x )} (p_{1}, p_{2} \in S) .

\overset{p}{^} = p \in E argmin D (\overset{p}{^} ∣∣ p) .

\overset{p}{^} = p \in E argmin D (\overset{p}{^} ∣∣ p) .

\overset{p}{^} = p \in M argmin D (p ∣∣ \overset{p}{^}) .

\overset{p}{^} = p \in M argmin D (p ∣∣ \overset{p}{^}) .

p_{3} := t p_{1} + (1 - t) p_{2} \in S

p_{3} := t p_{1} + (1 - t) p_{2} \in S

lo g p_{3} = t lo g p_{1} + (1 - t) lo g p_{2} + A

lo g p_{3} = t lo g p_{1} + (1 - t) lo g p_{2} + A

A := lo g ⎩ ⎨ ⎧ {ω \in Ω \sum exp {t lo g p_{1} (ω) + (1 - t) lo g p_{2} (ω)}}^{- 1} ⎭ ⎬ ⎫ .

A := lo g ⎩ ⎨ ⎧ {ω \in Ω \sum exp {t lo g p_{1} (ω) + (1 - t) lo g p_{2} (ω)}}^{- 1} ⎭ ⎬ ⎫ .

\left(\begin{array}[]{ccccc}x^{1}\\ \vdots\\ x^{m}\\ x^{m+1}\\ \vdots\\ x^{n}\end{array}\right)=A\left(\begin{array}[]{ccccc}\xi^{1}\\ \vdots\\ \xi^{m}\end{array}\right)+b.

\left(\begin{array}[]{ccccc}x^{1}\\ \vdots\\ x^{m}\\ x^{m+1}\\ \vdots\\ x^{n}\end{array}\right)=A\left(\begin{array}[]{ccccc}\xi^{1}\\ \vdots\\ \xi^{m}\end{array}\right)+b.

T(p,p_{1},...,p_{m}):=\left\{\hat{p}\in\mathcal{S}\left|\hat{p}=p+\left[p_{1}-p,...,p_{m}-p\right]\left(\begin{array}[]{ccccc}\xi^{1}\\ \vdots\\ \xi^{n}\end{array}\right),~{}(\xi^{a})_{a=1}^{m}\in\mathbb{R}^{m}\right.\right\}.

T(p,p_{1},...,p_{m}):=\left\{\hat{p}\in\mathcal{S}\left|\hat{p}=p+\left[p_{1}-p,...,p_{m}-p\right]\left(\begin{array}[]{ccccc}\xi^{1}\\ \vdots\\ \xi^{n}\end{array}\right),~{}(\xi^{a})_{a=1}^{m}\in\mathbb{R}^{m}\right.\right\}.

\left(\begin{array}[]{ccccc}\eta^{1}\\ \vdots\\ \eta^{m}\\ \eta^{m+1}\\ \vdots\\ \eta^{n}\end{array}\right)=\left[p_{1}-p,...,p_{m}-p\right]\left(\begin{array}[]{ccccc}\xi^{1}\\ \vdots\\ \xi^{m}\end{array}\right)+p

\left(\begin{array}[]{ccccc}\eta^{1}\\ \vdots\\ \eta^{m}\\ \eta^{m+1}\\ \vdots\\ \eta^{n}\end{array}\right)=\left[p_{1}-p,...,p_{m}-p\right]\left(\begin{array}[]{ccccc}\xi^{1}\\ \vdots\\ \xi^{m}\end{array}\right)+p

S_{i} := {p : Ω_{i} \to R_{++}; x \in Ω_{i} \sum p (x) = 1} (i = 1, 2),

S_{i} := {p : Ω_{i} \to R_{++}; x \in Ω_{i} \sum p (x) = 1} (i = 1, 2),

I (p (x, y)) := D (p (x, y) ∣∣ q (x) \cdot r (y))

I (p (x, y)) := D (p (x, y) ∣∣ q (x) \cdot r (y))

C := q (x) \in S_{1} sup I (q (x) \cdot r (y ∣ x)) .

C := q (x) \in S_{1} sup I (q (x) \cdot r (y ∣ x)) .

q^{(t + 1)} (x) := \frac{q ^{(t)} ( x ) exp { D ( r ( y ∣ x ) ∣∣ r ^{(t)} ( y ))}}{\sum _{x^{'}} q ^{(t)} ( x ^{'} ) exp { D ( r ( y ∣ x ^{'} ) ∣∣ r ^{(t)} ( y ))}},

q^{(t + 1)} (x) := \frac{q ^{(t)} ( x ) exp { D ( r ( y ∣ x ) ∣∣ r ^{(t)} ( y ))}}{\sum _{x^{'}} q ^{(t)} ( x ^{'} ) exp { D ( r ( y ∣ x ^{'} ) ∣∣ r ^{(t)} ( y ))}},

D (r (y ∣ x) ∣∣ r_{\overset{q}{^}} (y)) = C (\forall x \in Ω_{1}),

D (r (y ∣ x) ∣∣ r_{\overset{q}{^}} (y)) = C (\forall x \in Ω_{1}),

D (r (y ∣ x) ∣∣ r_{\overset{q}{^}} (y)) = \hat{C} (\forall x \in Ω_{1}),

D (r (y ∣ x) ∣∣ r_{\overset{q}{^}} (y)) = \hat{C} (\forall x \in Ω_{1}),

M

M

E

q (x) := y \in Ω_{2} \sum p (x, y),

q (x) := y \in Ω_{2} \sum p (x, y),

r (y) := x \in Ω_{1} \sum p (x, y),

C = p (x, y) \in M sup D (p (x, y) ∣∣ Π^{(m)} (p (x, y))),

C = p (x, y) \in M sup D (p (x, y) ∣∣ Π^{(m)} (p (x, y))),

I (p^{(t)} (x, y)) \leq I (p^{(t + 1)} (x, y))

I (p^{(t)} (x, y)) \leq I (p^{(t + 1)} (x, y))

I (p^{(t)} (x, y))

I (p^{(t)} (x, y))

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Geometry of Arimoto algorithm

Shoji Toyota

abstract

In information theory, the channel capacity, which indicates how efficient a given channel is, plays an important role. The best-used algorithm for evaluating the channel capacity is Arimoto algorithm [4]. This paper aims to reveal an information geometric structure of Arimoto algorithm. In the process of trying to reveal an information geometric structure of Arimoto algorithm, a new algorithm that monotonically increases the Kullback-Leibler divergence is proposed, which is named “the Backward em-algorithm.” Since the Backward em-algorithm is available in many cases where we need to increase the Kullback-Leibler divergence, it has a rich potential for application to many problems of statistics and information theory.

1 Introduction
2 Information Geometry
3 Channel capacity and Arimoto algorithm
4 Information geometric view of channel capacity in $\mathcal{S}_{3}$
5 Backward em-algorithm
6 Concluding Remarks
7 Appendices
7.1 Proof of Theorem 4.1
7.2 Proof of Lemma 4.2
7.3 Proof of the equation (18)
7.4 Convergence of the Backward em-algorithm

1 Introduction

Since C. E. Shannon proposed the notion of channel capacity [1], it has played an important role in information theory. Given a channel ( $\Omega_{1},r(y|x),\Omega_{2}$ ), the channel capacity $C$ is defined as follows:

[TABLE]

Here, $\Omega_{1}$ and $\Omega_{2}$ denote finite sets and $r(y|x)$ denotes a conditional probability on $\Omega_{2}$ for $x\in\Omega_{1}$ . The symbol $I$ denotes the mutual information of $q(x)\cdot r(y|x)$ and $\mathcal{S}_{1}$ denotes the set of all probability distributions on $\Omega_{1}$ .

Arimoto algorithm [4] is known as the best-used algorithm for evaluating the channel capacity of a memoryless channel, where we update $q^{(t)}(x)\in\mathcal{S}_{1}$ in order that $I(q^{(t)}(x)\cdot r(y|x))$ increases. Although many people have proposed other algorithms (e.g., [11]), they are essentially the same as Arimoto algorithm. It implies that Arimoto algorithm is not just an algorithm but has some generic structure. The purpose of this paper is to reveal a theoretical justification of Arimoto algorithm from the information geometric point of view.

There exist papers whose purpose are similar to the present paper, for example [6], [7], [8] and [9]. But [6] and [7] mention only the channel capacity but not Arimoto algorithm. Although [8] refers to Arimoto algorithm, we think it does not sufficiently explain a theoretical justification of Arimoto algorithm from the information geometric point of view (see Section 4 for more information). The paper [9] tries to interpret Arimoto algorithm by using the Kullback-Leibler divergence. But, to do so, [9] expands its domain outside of the probability simplex. Since Information Geometry is conventionally geometric structures on the probability simplex ( $i.e.$ , “inside” of the probability simplex), to reveal information geometric view of Arimoto algorithm, further studies are needed. Since our analysis is inside of the probability simplex, it can be said that we deal with more generic information geometric view than previous studies.

This paper is organized as follows. In Section 2, we summarize some terminologies and results of Information Geometry. In Section 3, we explain the channel capacity and Arimoto algorithm. Information geometric view of a channel capacity is investigated in Section 4. In Section 5, we propose an algorithm naturally induced from the information geometric view of a channel capacity addressed in Section 4, and prove that this algorithm corresponds to Arimoto algorithm. We conclude the paper with brief remarks in Section 6.

2 Information Geometry

In a narrow sense, Information Geometry on a finite model is a dually flat structure on the probability simplex. In this section, we summarize one of the dually flat structure used in the present paper.

Definiton 2.1.

Let N be a $C^{\infty}$ manifold, $g$ be a Rieamannian metric on $N$ and $\nabla$ , $\nabla^{*}$ be affine connections on $N$ . We call the triple $(g,\nabla,\nabla^{*})$ an dually structure on $N$ if

[TABLE]

holds. Here, $\chi(N)$ denotes the set of all vector fields on $N$ . Especially, if $\nabla$ and $\nabla^{*}$ are flat, we call the triple $(g,\nabla,\nabla^{*})$ a dually flat structure on N.

Let $\Omega$ be a finite set. We can regard the probability simplex

[TABLE]

as an $(|\Omega|-1)$ -dimensional submanifold of $\mathbb{R}^{n}$ . The Fisher metric $g$ and the m-connection $\nabla^{(m)}$ and the e-connection $\nabla^{(e)}$ on $\mathcal{S}$ are defined as follows:

[TABLE]

Here, $\overline{\nabla}$ denotes the Levi-Civita connection of $g$ and $S$ denotes the $(0,3)$ -tensor on $S$ defined by

[TABLE]

Note that the triple $(g,\nabla^{(m)},\nabla^{(e)})$ is a dually flat structure on $\mathcal{S}$ [2, p.35, p36, Theorem 3.1]. $\nabla^{(m)}$ and $\nabla^{(e)}$ have the global affine coordinate systems $(\eta_{i})^{|\Omega|-1}_{i=1}$ and $(\theta_{j})^{|\Omega|-1}_{j=1}$ defined as follows:

[TABLE]

$\nabla^{(m)}$ -geodesics and $\nabla^{(e)}$ -geodesics have the following interesting property.

Theorem 2.2.

[2, Theorem 3.8]** Let $p,q$ and $r$ be elements of $\mathcal{S}$ . Asssume that the $\nabla^{(m)}$ -geodesic connecting $p$ and $q$ and $\nabla^{(e)}$ -geodesic connecting $q$ and $r$ are orthogonal at q. Then,

[TABLE]

holds. Here, $D(~{}||~{})$ denotes the Kullback Leibler divergence defined as follows:

[TABLE]

The above theorem is called the generalized Pythagorean theorem.

Next, we define $\nabla^{(m)}$ -projections and $\nabla^{(e)}$ -projections.

Definiton 2.3.

Let $K$ be a submanifold of $\mathcal{S}$ and $p\in\mathcal{S}$ . We call $\hat{p}$ a $\nabla^{(m)}$ - resp. $\nabla^{(e)}$ - projection of $p$ onto $K$ if the $\nabla^{(m)}$ - resp. $\nabla^{(e)}$ -geodesic connecting $p$ and $\hat{p}$ are “orthogonal” to $K$ (with respect to the Fisher metric $g$ ) at $\hat{p}$ .

In general, a $\nabla^{(m)}$ -projection nor a $\nabla^{(e)}$ -projection is unique. But, if $K$ has the following property, the projection becomes unique.

Definiton 2.4.

Let $K$ be a submanifold of $\mathcal{S}$ . We say that $K$ is $\nabla^{(m)}$ -autoparallel if, for any $X,Y\in\chi(K)$ , $\nabla^{(m)}_{X}Y|_{p}\in T_{p}(K)$ . Similarly, $K$ is said to be $\nabla^{(e)}$ -autoparallel if, for any $X,Y\in\chi(K)$ , $\nabla^{(e)}_{X}Y|_{p}\in T_{p}(K)$ . Here, $T_{p}(K)$ denotes the tangent space of $K$ at $p$ embedded into the tangent space $T_{p}(\mathcal{S})$ .

Theorem 2.5.

[2, Theorem 3.9]** Let $M$ and $E$ be $\nabla^{(m)}$ -autoparallel and $\nabla^{(e)}$ -autoparallel submanifolds in $\mathcal{S}$ respectively. Let $p\in\mathcal{S}$ . Then, a necessary and sufficient condition for $\hat{p}$ to be a $\nabla^{(m)}$ -projection of $p$ onto $E$ is that $\hat{p}$ satisfies

[TABLE]

And the $\nabla^{(m)}$ -projection onto $E$ is unique if it exists.

Similarly, a necessary and sufficient condition for $\hat{p}$ to be a $\nabla^{(e)}$ -projection of $p$ onto $M$ is that $\hat{p}$ satisfies

[TABLE]

And the $\nabla^{(e)}$ -projection onto $M$ is unique if it exists.

We often need to investigate whether or not a submanifold $M$ of $\mathcal{S}$ is $\nabla^{(m)}$ or $\nabla^{(e)}$ -autoparallel. The following theorem gives a sufficient condition for $M$ to be $\nabla^{(m)}$ - and $\nabla^{(e)}$ -autoparallel.

Theorem 2.6.

Assume that, for any $p_{1},p_{2}\in M$ and $t\in(0,1)$ , the element

[TABLE]

belong to $M$ .

Then, $M$ is $\nabla^{(m)}$ - autoparallel.

Let $E$ be a submanifold of $\mathcal{S}$ . Assume that, for any $p_{1},p_{2}\in E$ and $t\in(0,1)$ , the element $p_{3}$ for which

[TABLE]

belong to $E$ . Then $E$ is $\nabla^{(e)}$ -autoparallel. Here, the constant $A$ , which is independent of $\omega\in\Omega$ , is defined by

[TABLE]

To prove Theorem 2.6, we need the following lemmma.

Lemma 2.7.

[3, Theorem 3.7.3]** Let $m,n\in\mathbb{N}$ with $0<m\leq n$ . Let $N$ be an n-dimensional flat manifold with respect to the affine connection $\nabla$ and there exists a global affine coordinate system $(x^{i})_{1\leq i\leq n}$ of $N$ . A necessary and sufficient condition for an m-dimensional submanifold $M$ of $N$ to be autoparallel is that there exists a local coordinate system $(\xi^{a})_{1\leq a\leq m}$ , an ( $m\times n$ )-matrix $A$ such that $rankA=m$ and $b\in\mathbb{R}^{n}$ which satisfy

[TABLE]

Proof of Theorem 2.6.

Assume that $M$ satisfies the equation (2). Then $M$ is convex with respect to the $\nabla^{(m)}$ -affine coordinate system $(\eta_{i})^{|\Omega|-1}_{i=1}$ . Fix $p\in M$ . Take $p_{1}(\neq p)\in M$ . If $m\geq 2$ , we can take $p_{2}\in M$ such that $p_{1}-p$ and $p_{2}-p$ are linearly independent. Repeating this, we can take $p_{1},...,p_{m}\in M$ such that $p_{1}-p,...,p_{m}-p$ are linearly independent. Define the “hyperplane” (with respect to $(\eta_{i})^{|\Omega|-1}_{i=1}$ ) $T(p,p_{1},...,p_{m})$ by

[TABLE]

Since $p_{1}-p,...,p_{m}-p$ are linearly independent, we can see that $rank\left[p_{1}-p,...,p_{m}-p\right]=m$ . Noting that $M$ is a submanifold of $T(p,p_{1},...,p_{m})$ and $dim(M)=dim(T(p,p_{1},...,p_{m}))$ , we can see that there exists a local coordinate system $(\xi^{a})_{1\leq a\leq m}$ of $M$ such that

[TABLE]

holds. From Lemma 2.7, we can see that $M$ is $\nabla^{(m)}$ -autoparallel. The proof of the latter half is same as the above proof. ∎

3 Channel capacity and Arimoto algorithm

In this paper, let $\Omega_{i}$ ( $i=1,2$ ) be finite sets, $\mathcal{S}_{i}$ be the sets of all probability distributions on $\Omega_{i}$ . Namely,

[TABLE]

where $\mathbb{R}_{++}:=\{x\in\mathbb{R};x>0\}$ . Similarly, let $\mathcal{S}_{3}$ be the set consisting of all probability distributions on $\Omega_{1}\times\Omega_{2}$ .

A memoryless channel is expressed by a system where, for an input symbol $x\in\Omega_{1}$ , an output symbol $y\in\Omega_{2}$ is determined at random.

Definiton 3.1.

A channel is defined by a triple $(\Omega_{1},r(y|x),\Omega_{2})$ of finite sets $\Omega_{1},\Omega_{2}$ and a map $\Omega_{1}\owns x\mapsto r(\cdot|x)\in\mathcal{S}_{2}$ .

Definiton 3.2.

We call the map $I:\mathcal{S}_{3}\rightarrow\mathbb{R}$ defined by

[TABLE]

the mutual information. In the equation $(\ref{Mutual})$ , $q(x)$ and $r(y)$ mean the marginal distributions of $p(x,y)$ on $\Omega_{1}$ and $\Omega_{2}$ respectively.

Definiton 3.3.

Given a channel $(\Omega_{1},r(y|x),\Omega_{2})$ , the channel capacity is defined by

[TABLE]

Arimoto algorithm is to update from $q^{(t)}(x)\in\mathcal{S}_{1}$ to

[TABLE]

where $r^{(t)}(y)$ means the marginal distribution of $q^{(t)}(x)\cdot r(y|x)$ . It is known that, by using this algorithm, $I(q^{(t)}(x)\cdot r(y|x))$ monotonically increases and converges to the channel capacity [4, Theorem 2].

4 Information geometric view of channel capacity in $\mathcal{S}_{3}$

Let us try to characterize the channel capacity from the information geometric point of view. In [6] and [4], the the channel capacity in $\mathcal{S}_{2}$ is referred to. Let us review their outline. A probability distribution that attains the channel capacity satisfies the following interesting condition:

Theorem 4.1.

[4, Lemma 1]** [6, p.554–555] Assume that a probability distribution $\hat{q}(x)\in{\mathcal{S}_{1}}$ attains the channel capacity $C$ . Then $\hat{q}(x)$ satisfies the following condition:

[TABLE]

where $r_{\hat{q}}(y)$ denotes the marginal distribution of $\hat{q}(x)\cdot r(y|x)$ on $\Omega_{2}$ . Conversely, if there exist $\hat{C}\geq 0$ and $\hat{q}\in\mathcal{S}_{1}$ satisfying

[TABLE]

then $\hat{C}\geq 0$ and $\hat{q}(x)$ are the channel capacity and a probability distribution that attains the channel capacity, respectively.

The proof is given in Section 7.1 for convenience’ sake. Theorem 4.1 tells us that, from information geometric view of $\mathcal{S}_{2}$ , the channel capacity is a “circumcenter” of the polyhedron spanned by $\{r(y|x)\}^{|\Omega_{1}|}_{x=1}$ .

[8] refers to an information geometric interpretation of Arimoto algorithm in $\mathcal{S}_{2}$ , using the result of Theorem 4.1:

Given a current guess $p^{(t)}(x)$ , we should check the Kullbuck-Leibler divergences $D(r(y|j)||r_{q^{(t)}}(y))$ and move the output distribution closer to those $r(y|x)$ for which $D(r(y|j)||r_{q^{(t)}}(y))$ is large. This can be achived by increasing the respective weights $p^{(t)}(j)$ , consistent with the recursion (5) that increases (decreases) those input probabilities for which $\exp\{D(r(y|j)||r_{q^{(t)}}(y))\}$ is above (below) the average $\sum_{x}q^{(t)}(x)\exp\{D(r(y|x)||r^{(t)}(y))\}$ .

Although the explanation seems to be valid intuitively, it does not seems to succeed in revealing the behavior of $r_{q^{(t)}}(y)$ in $\mathcal{S}_{2}$ as $t$ is updated accurately. Therefore, in our opinion, further researches are needed to reveal the information geometric view of Arimoto algorithm.

In this section, we reconsider the information geometric view of the channel capacity in $\mathcal{S}_{3}$ . We may be able to see some interesting structure in $\mathcal{S}_{3}$ which is hidden in $\mathcal{S}_{2}$ .

Define subsets $M$ and $E$ of $\mathcal{S}_{3}$ by

[TABLE]

From Theorem 2.6, we can see that $M$ is $\nabla^{(m)}$ -autoparallel and $E$ is $\nabla^{(e)}$ -autoparallel.

Lemma 4.2.

For $p(x,y)\in\mathcal{S}_{3}$ , the $\nabla^{(m)}$ -projection of $p$ onto $E$ is $q(x)\cdot r(y)$ , where $q(x)$ and $r(y)$ are defined by

[TABLE]

that is, $q(x)$ and $r(y)$ are the marginal distributions of $p(x,y)$ .

The proof is given in Section 7.2. By utilizing Lemma 4.2, the channel capacity $C$ is expressed as follows:

[TABLE]

where $\Pi^{(m)}(p(x,y))$ means the $\nabla^{(m)}$ -projection of $p(x,y)$ onto $E$ . The formula (8) says that, from the viewpoint of geometry in $\mathcal{S}_{3}$ , the channel capacity C is the longest “distance” (between $p(x,y)$ and $\Pi^{(m)}(p(x,y))$ ) from $M$ to $E$ (Fig. 1).

5 Backward em-algorithm

In Section 4, we reveal an information geometric structure of the channel capacity in $\mathcal{S}_{3}$ . Therefore, if we can make an algorithm monotonically increasing the Kullback-Leibler divergence, we can expect that this algorithm is useful for evaluating the channel capacity.

An algorithm, which monotonically decreases the Kullback-Leibler divergence, is well known as “the em-algorithm” [10]. Then how can we increase the Kullback-Leibler divergence? It will be a strong candidate to project onto a $\nabla^{(m)}$ ( $\nabla^{(e)}$ )-autoparallel submanifold by a $\nabla^{(m)}$ ( $\nabla^{(e)}$ )-geodesic. But since this projection is a critical point of the Kullback-Leibler divergence, this may sometimes decrease the Kullback-Leibler divergence. Hence, an algorithm that uses this idea is not necessarily a steady algorithm that increases the Kullback-Leibler divergence and converges to the channel capacity $C$ .

To overcome this difficulty, let us try to use the idea that rewinds the em-algorithm, same as rewinding movie films!

Definiton 5.1.

Define $\mathcal{S}_{3}$ , $M$ and $E$ in the same way as Section 3. For $q^{(t)}(x)\cdot r(y|x)=:p^{(t)}(x,y)\in M$ , update $q^{(t+1)}(x)\cdot r(y|x)=:p^{(t+1)}(x,y)\in M$ as follows:

Backward e-step.

Search $q_{(t+1)}(x)\cdot r_{(t+1)}(y)\in E$ such that the unique $\nabla^{(e)}$ -projection from $q_{(t+1)}(x)\cdot r_{(t+1)}(y)$ onto M is $p^{(t)}(x,y)$ .

Backward m-step.

Search $q^{(t+1)}(x)\cdot r(y|x)\in M$ such that the unique $\nabla^{(m)}$ -projection from $q^{(t+1)}(x)\cdot r(y|x)$ onto E is $q_{(t+1)}(x)\cdot r_{(t+1)}(y)$ .

We call this algorithm “the Backward em-algorithm” (See Fig.1).

Theorem 5.2.

By using the Backward em-algorithm, $I(p^{(t)}(x,y))$ increases as $p^{(t)}(x,y)$ is updated. Namely, the following equality

[TABLE]

holds.

Proof.

[TABLE]

Note that the second and third equalities follow from the generalized Pythagorean theorem. ∎

Although we define the Backward em-algorithm, we can not determine whether or not there exist any probability distributions $q_{(t+1)}(x)\cdot r_{(t+1)}(y)$ which satisfy $\Pi^{(e)}(q_{(t+1)}(x)\cdot r_{(t+1)}(y))=p^{(t)}(x,y)$ for a given probability distribution $p^{(t)}(x,y)\in M$ . Therefore it is not trivial that we can carry out the Backward e-step. For $p^{(t)}(x,y)\in M$ , do there exist any probability distributions $q_{(t+1)}(x)\cdot r_{(t+1)}(y)\in E$ which satisfy $\Pi^{(e)}(q_{(t+1)}(x)\cdot r_{(t+1)}(y))=p^{(t)}(x,y)$ ? And if any, can we write $q_{(t+1)}(x)\cdot r_{(t+1)}(y)$ explicitly? The following theorem answers positively to the above two questions.

Theorem 5.3.

Let $q^{(t)}(x)\cdot r(y|x)\in M$ . Then the following two statements for $q(x)\in\mathcal{S}_{1}$ and $r(y)\in\mathcal{S}_{2}$ are equivalent:

1. $q(x)\cdot r(y)\in E$ satisfies

[TABLE]

where $\Pi^{(e)}(q(x)\cdot r(y))$ denotes the $\nabla^{(e)}$ -projection from $q(x)\cdot r(y)$ onto $M$ .

2. $q(x)\cdot r(y)\in E$ satisfies

[TABLE]

Proof.

Fix $q(x)\cdot r(y)$ contained in $E$ . Define $L:{\mathbb{R}^{n}_{++}}\times\mathbb{R}\rightarrow\mathbb{R}$ by

[TABLE]

Noting that

[TABLE]

we see that (9) is equivalent to the following:

[TABLE]

Observing that

[TABLE]

we can see that

[TABLE]

which concludes the proof. ∎

From Theorem 5.3, we can deduce the following interesting theorem.

Theorem 5.4.

The subset $E^{(t)}$ of $E$ defined by

[TABLE]

is $\nabla^{(e)}$ -autoparallel.

Proof.

It suffices to prove that, for any $q_{1}(x)\cdot r_{1}(y)$ and $q_{2}(x)\cdot r_{2}(y)$ contained in $E^{(t)}$ and any $t$ with $0\leq t\leq 1$ , there exists $q_{3}(x)\cdot r_{3}(y)$ contained in $E^{(t)}$ satisfying

[TABLE]

where the normalization term $\Phi_{1}(t)$ is defined by

[TABLE]

Calculating the left-hand side (LHS) of (11), we obtain

(LHS) $=t\log q_{1}(x)+(1-t)\log q_{2}(x)+t\log r_{1}(y)+(1-t)\log r_{2}(y)$ .

Let us calculate $t\log q_{1}(x)+(1-t)\log q_{2}(x)$ . Noting that the pairs ( $q_{1}(x),r_{1}(y)$ ) and ( $q_{2}(x),r_{2}(y)$ ) satisfy (10),

[TABLE]

where the normalization factors $\Phi_{2}(t)$ and $\Phi_{3}(t)$ are defined by

[TABLE]

Define $r_{3}(y)\in\mathcal{S}_{2}$ by

[TABLE]

where

[TABLE]

Then we can see that $t(D(r(y||x)||r_{1}(y)))+(1-t)(D(r(y||x)||r_{2}(y)))$ can be rewritten as follows by using $r_{3}(y)$ defined by (12):

[TABLE]

Set $q_{3}(x)\in\mathcal{S}_{1}$ by

[TABLE]

where $\Phi_{5}(t):=\left\{\sum_{x\in\mathcal{S}_{1}}q^{(t)}(x)\exp{D(r(y|x)||r_{3}(y))}\right\}^{-1}$ . Then, we obtain

[TABLE]

Hence

[TABLE]

holds, and therefore it concludes the proof. ∎

Theorem 5.3 and Theorem 5.4 tell us that, for any probability distribution $p^{(t)}(x,y)\in M$ , we can carry out the Backward e-step and the set $E^{(t)}$ of candidates $q_{(t+1)}(x)\cdot r_{(t+1)}(y)$ for the Backward e-step is an exponential family.

Next, let us consider whether or not we can carry out the Backward m-step. Which element should we choose in $E^{(t)}$ to carry out the Backward m-step? That is, what are conditions of $q_{(t+1)}(x)\cdot r_{(t+1)}(y)\in E^{(t)}$ that there exists $p^{(t+1)}(x,y)\in M$ such that $\Pi^{(m)}(p^{(t+1)}(x,y))=q_{(t+1)}(x)\cdot r_{(t+1)}(y)$ holds?

To investigate this question, let $\Pi^{(m)}(M)$ be the embedding of $M$ into $E$ by $\nabla^{(m)}$ -projection. Assume that there exist any intersections of $\Pi^{(m)}(M)$ with $E^{(t)}$ (its existence and uniqueness is discussed in Section 6). Let $\hat{q}_{(t+1)}(x)\cdot\hat{r}_{(t+1)}(y)\in\Pi^{(m)}(M)\bigcap E^{(t)}$ . Then, for $\hat{q}_{(t+1)}(x)\cdot\hat{r}_{(t+1)}(y)$ , we can carry out the Backward m-step. Conversely, assume that, for $\hat{q}_{(t+1)}(x)\cdot\hat{r}_{(t+1)}(y)$ , we can carry out the Backward m-step. Then, $\hat{q}_{(t+1)}(x)\cdot\hat{r}_{(t+1)}(y)\in\Pi^{(m)}(M)\bigcap E^{(t)}$ . Hence, the problem of searching $q_{(t+1)}(x)\cdot r_{(t+1)}(y)\in E^{(t)}$ where the Backward m-step can be carried out is equivalent to the one of searching any intersections of $\Pi^{(m)}(M)$ with $E^{(t)}$ .

The element $q_{(t+1)}(x)\cdot r_{(t+1)}(y)\in E^{(t)}$ is only depend on $r_{(t+1)}(y)$ because, for a given $r_{(t+1)}(y)$ , the requirement that $q_{(t+1)}(x)\cdot r_{(t+1)}(y)\in E^{(t)}$ determine $q_{(t+1)}(x)$ by the equation (10). Therefore, from now on, we may see $q_{(t+1)}(x)$ as as a function of $r_{(t+1)}(y)$ determined by the requirement that $q_{(t+1)}(x)\cdot r_{(t+1)}(y)\in E^{(t)}$ ( $i.e.$ , the equation (10)). Taking it into consideration, we may consider the condition of $r_{(t+1)}(y)$ such that

[TABLE]

holds. Noting that $\Pi^{(m)}(q^{(t+1)}(x)\cdot r(y|x))=q^{(t+1)}(x)\cdot r_{q^{(t+1)}}(y)$ (See Theorem 4.2), where $r_{q^{(t+1)}}(y)$ denotes the marginal distribution of $q_{(t+1)}(x)\cdot r(y|x)$ , the condition (14) of $r_{(t+1)}(y)$ is equivalent to the following condition:

[TABLE]

The above condition comes down to solving the following nonlinear equation with respect to $r_{(t+1)}(y)$ :

[TABLE]

Rewritting this as

[TABLE]

we see that it is difficult to solve the nonlinear equation (16) with respect to $r_{(t+1)}(y)$ . If we can solve the equation (16), we can prove that $I(p^{(t)}(x,y))$ converges to the channel capacity $C$ . The proof is given in Section 7.4.

As it is difficult to solve the equation (16) with respect to $r_{(t+1)}(y)$ , we try to approximate (16) in order that we can solve. It will be a good solution to approximate $\exp(D(r(y|x)||r_{(t+1)}(y))$ to some value that is independent of $x$ since it becomes a constant value. It seems good to approximate $r(y|x)$ of $\exp(D(r(y|x)||r_{(t+1)}(y))$ to the “circumcenter” $r^{*}(y)$ of the figure induced from $\{r(y|x)\}_{x=1}^{|\Omega_{1}|}$ in $\mathcal{S}_{2}$ , that is, the probability distribution contained in $\mathcal{S}_{2}$ that attains the channel capacity (see Theorem 4.1). Then, observing that $\exp(D(r(y|x)||r_{(t+1)}(y))$ becomes independent of $x$ , (14) is rewtitten as

[TABLE]

that can be solved. The merit of this approximation is that $r^{*}(y)$ has dissapearred in the equation (17). Namely, even if we do not know the value of $r^{*}(y)$ , we can solve (17). Since the solution of (17) is $r_{(t+1)}(y)=r_{q^{(t)}}(y)$ , the approximation designates the element of $E^{(t)}$ by $q_{(t+1)}(x)\cdot r_{q^{(t)}}(y)$ where $q_{(t+1)}(x)\in\mathcal{S}_{1}$ is defined by

[TABLE]

In the present paper, we call the approximation the approximate Backward e-step.

By the above approximation, we can solve the equation (16). But, in return for the approximation, $q_{(t+1)}(x)\cdot r_{q^{(t)}}(y)$ is not necessarily an intersection of $\Pi^{(m)}(M)$ with $E^{(t)}$ , and therefore we need to approximate the Backward m-step too. In the present paper, we approximate the Backward m-step by the $\nabla^{(m)}$ -projection of $q_{(t+1)}(x)\cdot r_{q^{(t)}}(y)$ onto $M$ . A short computation shows that

[TABLE]

The proof is given in Section 7.3. We call this approximation the approximate Backward m-step.

Combining the approximate Backward m- and e-steps, $q^{(t)}(x)\cdot r(y|x)$ is updated by $q_{(t+1)}(x)\cdot r(y|x)$ , and therefore, $q^{(t)}(x)$ is updated by $q_{(t+1)}(x)$ , which is nothing but Arimoto algorithm (Fig. 2).

6 Concluding Remarks

In the present paper, we investigated the channel capacity from the information geometric point of view in $\mathcal{S}_{3}$ . Then, we introduced the new algorithm that monotonically increases the Kullback-Leibler divergence, “the Backward em-algorithm.” The Backward e-step can be determined but the Backward m-step cannot be. Hence, we tried to approximate the Backward m-step, which corresponds to Arimoto algorithm.

There are many open problems left. First, existence and uniqueness of an intersection of $\Pi{(M)}$ with $E^{(t)}$ should be studied. To research the problem, we may consider the uniqueness and existence of a solution in the equation (16). If we can prove that there exists a solution of the equation (16), even if we cannot solve, we may be able to introduce other approximations of the Backward e- and m-steps and accelerate Arimoto algorithm.

It seems interesting to apply the Backward em-algorithm to other subjects. In our knowledge, there has been no algorithm that monotonically increases the Kullback-Leibler divergence. We can use the Backward em-algorithm when we want to increase the Kullback-Leibler divergence between two manifolds. For example, in the field of independent component analysis and machine learning, we often need to increase the mutual information (e.g., [12, 13]). In these situations, there is a possibility that the Backward em-algorithm works well because information geometric view of the mutual information is the Kullback-Leibler divergence between two manifolds.

Acknowledgments

The author gratefully acknowledges the continuous encouragement from Toru Ohira and Hideyuki Ishi. The author also thanks Professors Amor Keziou, Hiroshi Matsuzoe, Masahito Hayashi, Shiro Ikeda and Phillippe Regnault for their helpful discussions and comments.

7 Appendices

7.1 Proof of Theorem 4.1

(First half): Define a function $L:{\mathbb{R}^{m}_{++}}\times\mathbb{R}\rightarrow\mathbb{R}$ by

[TABLE]

where $\lambda$ means a Lagrange multiplier. For the mutual information to take a maximum point at $\hat{q}\in{\mathcal{S}_{1}}$ , it is necessary that

[TABLE]

Since ${\partial L}/{\partial q_{i}}=D(r(y|i)||r_{q}(y))-1+\lambda$ , (19) is rewritten as

[TABLE]

and it follows immediately that $1-\hat{\lambda}$ corresponds to the channel capacity $C$ and the relation (6) holds.

(Second half): The minmax redundancy, defined by

[TABLE]

coincides with the channel capacity [5, Theorem13.1.1], where $r_{q}(y)$ is the marginal distribution of $q(x)\cdot r(y|x)$ . Since, for $\hat{q}$ $\in\mathcal{S}_{1}$ satisfying (7), the equality $\max_{x\in\Omega_{1}}D(r(y|x)||r_{\hat{q}}(y))=\hat{C}$ holds, it follows that $C\leq\hat{C}$ . Noting that the $\hat{q}$ satisfies $I(\hat{q}(x)\cdot r(y|x))=\hat{C}$ and taking the definition of the channel capacity into the consideration, it also follows that $C\geq\hat{C}$ , and therefore, $C=\hat{C}$ .

7.2 Proof of Lemma 4.2

Take any $\hat{q}\cdot\hat{r}$ contained in $E$ . Then

[TABLE]

holds and the lower bound [math] is attained if and only if $\hat{q}(x)\cdot\hat{r}(y)=q(x)\cdot r(y)$ (since $D(p_{1}||p_{2})\Leftrightarrow p_{1}=p_{2}$ holds [5, p.31]). Observing that the ${\nabla}^{(m)}$ -projection $\Pi^{(m)}\left(p(x,y)\right)$ onto the ${\nabla}^{(e)}$ -autoparallel submanifold $E$ is characterized by

[TABLE]

it concludes the proof.

7.3 Proof of the equation (18)

It suffices to prove the following lemma.

Lemma 7.1.

Let $q(x)\cdot r(y)\in E$ . Then $q(x)\cdot r(y|x)$ is one of the candidates for $\nabla^{(m)}$ - projections of $q(x)\cdot r(y)$ onto $M$ .

Note that since $M$ is not $\nabla^{(e)}$ -autoparallel, a $\nabla^{(m)}$ - projection onto $M$ is not necessarily unique.

Proof.

Take any $\hat{q}(x)\cdot r(y|x)$ contained in $M$ . Then

[TABLE]

holds and the lower bound [math] is attained if and only if $\hat{q}(x)=q(x)$ . Observing that, if

[TABLE]

holds, $p(x,y)$ is one of the candidates for $\nabla^{(m)}$ - projections of $q(x)\cdot r(y)$ onto $M$ [2, Theorem 3.10], it concludes the proof. ∎

7.4 Convergence of the Backward em-algorithm

In this section, we assume that the equation (16) can be solved and that $p^{(t)}(x,y)$ can be updated to $p^{(t+1)}(x,y)$ any number of times by the Backward em-algorithm.

Theorem 7.2.

$I(p^{(t)})$ * converges to the channel capacity $C$ as $p^{(t)}$ is updated.*

Lemma 7.3.

Let $q\cdot r\in E^{(t)}$ . Then,

[TABLE]

where $\Phi(t,r):=\sum_{x\in\Omega_{1}}q^{(t)}(x)\exp D(r(y|x)||r(y))$ .

Proof.

[TABLE]

∎

Proof of Theorem 7.2. It suffices to prove that $D(p^{(t)}||p_{(t+1)})$ converges to the channel capacity $C$ . Let $q^{(0)}(x)\in\overline{\mathcal{S}_{1}}$ be a probability distribution that attains the channel capacity $C$ . First, let us prove that

[TABLE]

Calculating $\sum_{x}q^{(0)}(x)\log\{q^{(t+1)}(x)/q^{(t)}(x)\}$ , we obtain

[TABLE]

and therefore, we obtain the inequality (20), where $r_{q^{(0)}}(y)$ denotes the marginal distribution of $q^{(0)}(x)\cdot r(y|x)$ on $\Omega_{2}$ . Summing up the both sides of the inequality (20), we have

[TABLE]

Noting that $0\leq D(q^{(0)}(x)||q^{(1)}(x))<\infty$ and is independent of $t$ , we can see that the sequence $\{C-D(p^{(t)}||p_{(t+1)})\}_{t=1}^{\infty}$ converges [math]. ∎

Bibliography13

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. E. Shannon,“A Mathematical Theory of Communication,” Bell System Technical Journal, vol.27, 379–423 and 623–656 (1948).
2[2] S. Amari and H. Nagaoka, M e t h o d s o f I n f o r m a t i o n G e o m e t r y 𝑀 𝑒 𝑡 ℎ 𝑜 𝑑 𝑠 𝑜 𝑓 𝐼 𝑛 𝑓 𝑜 𝑟 𝑚 𝑎 𝑡 𝑖 𝑜 𝑛 𝐺 𝑒 𝑜 𝑚 𝑒 𝑡 𝑟 𝑦 Methods~{}of~{}Information~{}Geometry (AMS and Oxford, 2000).
3[3] A. Fujiwara, “Foundations of Information Geometry (Makino Shoten, Tokyo, 2015),” in Japanese.
4[4] S. Arimoto, “An algorithm for computing the capacity of arbitrary discrete memoryless channels,” IEEE Trans. Inf. Theory, vol.18, 14–20 (1972).
5[5] T. M. Cover and J. A. Thomas, E l e m e n t s o f I n f o r m a t i o n T h e o r y , 2 n d e d . 𝐸 𝑙 𝑒 𝑚 𝑒 𝑛 𝑡 𝑠 𝑜 𝑓 𝐼 𝑛 𝑓 𝑜 𝑟 𝑚 𝑎 𝑡 𝑖 𝑜 𝑛 𝑇 ℎ 𝑒 𝑜 𝑟 𝑦 2 𝑛 𝑑 𝑒 𝑑 Elements~{}of~{}Information~{}Theory,~{}2nd~{}ed. (Wiley, 2006).
6[6] J. Takeuchi and S. Ikeda, “An Information Geometrical Study on Communication Channel Capacity,” Symposium on Information Theory and its Applications, vol.33, 553–558 (2010).
7[7] K. Nakagawa, K. Watanabe and T. Sabu, “On the Search Algorithm for the Output Distribution That Achieves the Channel Capacity,” IEEE Trans. Inf. Theory, vol.63, 1043–1062 (2017).
8[8] G. Matz and P. Duhamel, “Information geometric formulation and interpretation of accelerated Blahut-Arimoto-Type algorithms,” in Proc. Information Theory Workshop, 24–29, San Antonio, Texas, October, (2004).

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Geometry of Arimoto algorithm

abstract

Contents

1 Introduction

2 Information Geometry

Definiton 2.1**.**

Theorem 2.2**.**

Definiton 2.3**.**

Definiton 2.4**.**

Theorem 2.5**.**

Theorem 2.6**.**

Lemma 2.7**.**

Proof of Theorem 2.6.

3 Channel capacity and Arimoto algorithm

Definiton 3.1**.**

Definiton 3.2**.**

Definiton 3.3**.**

4 Information geometric view of channel capacity in S3\mathcal{S}_{3}S3​

Theorem 4.1**.**

Lemma 4.2**.**

5 Backward em-algorithm

Definiton 5.1**.**

Theorem 5.2**.**

Proof.

Theorem 5.3**.**

Proof.

Theorem 5.4**.**

Proof.

6 Concluding Remarks

Acknowledgments

7 Appendices

7.1 Proof of Theorem 4.1

7.2 Proof of Lemma 4.2

7.3 Proof of the equation (18)

Lemma 7.1**.**

Proof.

7.4 Convergence of the Backward em-algorithm

Theorem 7.2**.**

Lemma 7.3**.**

Proof.

Definiton 2.1.

Theorem 2.2.

Definiton 2.3.

Definiton 2.4.

Theorem 2.5.

Theorem 2.6.

Lemma 2.7.

Definiton 3.1.

Definiton 3.2.

Definiton 3.3.

4 Information geometric view of channel capacity in $\mathcal{S}_{3}$

Theorem 4.1.

Lemma 4.2.

Definiton 5.1.

Theorem 5.2.

Theorem 5.3.

Theorem 5.4.

Lemma 7.1.

Theorem 7.2.

Lemma 7.3.