Optimization on Product Submanifolds of Convolution Kernels

Mete Ozay; Takayuki Okatani

arXiv:1701.06123·cs.CV·November 28, 2017

Optimization on Product Submanifolds of Convolution Kernels

Mete Ozay, Takayuki Okatani

PDF

Open Access

TL;DR

This paper develops a geometry-aware optimization method for CNNs that trains on ensembles of product submanifolds of kernels, improving convergence and classification performance across multiple datasets.

Contribution

It introduces a novel approach for training CNNs on ensembles of product submanifolds of kernels, with a geometry-aware SGD algorithm and convergence analysis.

Findings

01

G-SGD improves training loss and convergence.

02

Classification performance is boosted using ensembles of PEMs.

03

Geometry-aware step size methods enhance CNN training.

Abstract

Recent advances in optimization methods used for training convolutional neural networks (CNNs) with kernels, which are normalized according to particular constraints, have shown remarkable success. This work introduces an approach for training CNNs using ensembles of joint spaces of kernels constructed using different constraints. For this purpose, we address a problem of optimization on ensembles of products of submanifolds (PEMs) of convolution kernels. To this end, we first propose three strategies to construct ensembles of PEMs in CNNs. Next, we expound their geometric properties (metric and curvature properties) in CNNs. We make use of our theoretical results by developing a geometry-aware SGD algorithm (G-SGD) for optimization on ensembles of PEMs to train CNNs. Moreover, we analyze convergence properties of G-SGD considering geometric properties of PEMs. In the experimental…

Tables3

Table 1. TABLE I : Results for Resnet-44 on Cifar-10 with DA.

Model	Class. Error(%)
Euc. [9]	7.17
Euc. [21]	7.16
Euc. $†$	7.05
Sp/Ob/St [21]	6.99/6.89/6.81
Sp/Ob/St $†$	6.84/6.87/ 6.73
PEMs of Sp/Ob/St	6.81/6.85/ 6.70
PI for PEMs of Sp/Ob/St	6.82/6.81/ 6.70
PI (Euc.+Sp/Euc.+St/Euc.+Ob)	6.89/6.84/6.88
PI (Sp+Ob/Sp+St/Ob+St)	6.75/6.67/6.59
PI (Sp+Ob+St/Sp+Ob+St+Euc.)	6.31/6.34
PO for PEMs of Sp/Ob/St	6.77/6.83/ 6.65
PO (Euc.+Sp/Euc.+St/Euc.+Ob)	6.85/6.78/6.90
PO (Sp+Ob/Sp+St/Ob+St)	6.62/6.59/6.51
PO (Sp+Ob+St/Sp+Ob+St+Euc.)	6.35/6.22
PIO for PEMs of Sp/Ob/St	6.71/6.73/ 6.61
PIO (Euc.+Sp/Euc.+St/Euc.+Ob)	6.95/6.77/6.82
PIO (Sp+Ob/Sp+St/Ob+St)	6.21/6.19/6.25
PIO (Sp+Ob+St/Sp+Ob+St+Euc.)	5.95/ 5.92

Table 2. TABLE II : Results for Resnet-18 which are trained using Imagenet for single crop validation error rate (%).

Model	Top-1 Error (%)
Euc. [21]	30.59
Euc. $†$	30.31
Sp/Ob/St[21]	29.13/28.97/28.14
Sp/Ob/St $†$	28.71/28.83/28.02
PEMs of Sp/Ob/St	28.70/28.77/28.00
PI for PEMs of Sp/Ob/St	28.69/28.75/27.91
PI (Euc.+Sp/Euc.+St/Euc.+Ob)	30.05/29.81/29.88
PI (Sp+Ob/Sp+St/Ob+St)	28.61/28.64/28.49
PI (Sp+Ob+St/Sp+Ob+St+Euc.)	27.63/27.45
PO for PEMs of Sp/Ob/St	28.67/28.81/27.86
PO (Euc.+Sp/Euc.+St/Euc.+Ob)	29.58/29.51/29.90
PO (Sp+Ob/Sp+St/Ob+St)	28.23/28.01/28.17
PO (Sp+Ob+St/Sp+Ob+St+Euc.)	27.81/27.51
PIO for PEMs of Sp/Ob/St	28.64/28.72/27.83
PIO (Euc.+Sp/Euc.+St/Euc.+Ob)	29.19/28.25/28.53
PIO (Sp+Ob/Sp+St/Ob+St)	28.14/27.66/27.90
PIO (Sp+Ob+St/Sp+Ob+St+Euc.)	27.11/ 27.07

Table 3. TABLE III : Classification error (%) for training 110-layer Resnets with constant depth (RCD) and Resnets with stochastic depth (RSD) using the PIO scheme on Cifar-10 and Cifar-100, with and without using DA.

Model	Cifar-10 w. DA	Cifar-100 w. DA	Cifar-10 w/o DA	Cifar-100 w/o DA
RCD [11]	6.41	27.22	13.63	44.74
(Euc.) $†$	6.30	27.01	13.57	44.65
Sp/Ob/St ([21])	6.22/6.07/5.93	26.44/25.99/25.41	13.11/12.94/12.88	42.51/42.30/40.11
Sp/Ob/St $†$	6.05/6.03/5.91	26.19/25.87/25.39	12.96/12.85/12.79	42.13/42.00/39.94
PEMs of Sp/Ob/St	6.00/6.01/5.86	25.93/25.74/25.18	12.74/12.77/12.74	42.02/42.88/39.90
PIO for PEMs of Sp/Ob/St	5.95/5.91/5.83	25.89/25.71/25.12	12.71/12.72/12.69	41.68/42.75/39.83
PIO (Euc.+Sp/Euc.+St/Euc.+Ob)	6.03/5.99/6.01	25.57/25.49/25.64	12.77/12.21/12.92	41.90/41.37/41.85
PIO (Sp+Ob/Sp+St/Ob+St)	5.97/5.86/5.46	24.71/24.96/24.76	11.47/11.65/ 11.51	41.49/40.53/40.34
PIO (Sp+Ob+St/Sp+Ob+St+Euc.)	5.25/ 5.17	23.96/ 23.79	11.29/ 11.15	39.53/ 39.35
RSD [11]	5.23	24.58	11.66	37.80
Euc. $†$	5.17	24.39	11.40	37.55
Sp/Ob/St [21]	5.20/5.14/4.79	23.77/23.81/23.16	10.91/10.93/10.46	36.90/36.47/35.92
Sp/Ob/St $†$	5.08/5.11/4.73	23.69/23.75/23.09	10.52/10.66/10.33	36.71/36.38/35.85
PEMs of Sp/Ob/St	5.05/5.08/4.69	23.51/23.60/23.85	10.41/10.54/10.25	36.40/36.11/35.53
PIO for PEMs of Sp/Ob/St	4.95/5.03/4.62	23.47/23.51/23.77	10.37/10.51/10.19	36.33/36.02/35.41
PIO (Euc.+Sp/Euc.+St/Euc.+Ob)	5.00/5.08/5.14	23.69/23.25/23.32	10.74/10.25/10.93	35.76/35.55/35.81
PIO (Sp+Ob/Sp+St/Ob+St)	4.70/4.58/4.90	22.84/22.91/22.80	10.13/10.24/10.06	35.66/35.01/35.35
PIO (Sp+Ob+St/Sp+Ob+St+Euc.)	4.29/4.31	22.19/ 22.03	9.52/9.56	34.49/ 34.25

Equations28

f_{l} (X_{l}; W_{l}) = f_{l} (\cdot; W_{l}) \circ \dots \circ f_{1} (X_{1}; W_{1}),

f_{l} (X_{l}; W_{l}) = f_{l} (\cdot; W_{l}) \circ \dots \circ f_{1} (X_{1}; W_{1}),

L (W) ≜ E_{P} {L (W, s)} = \int L (W, s) d P .

L (W) ≜ E_{P} {L (W, s)} = \int L (W, s) d P .

L (ω) ≜ E_{P} {L (ω, s)} = \int L (ω, s) d P .

L (ω) ≜ E_{P} {L (ω, s)} = \int L (ω, s) d P .

W min L (W)

W min L (W)

M_{G_{l}^{m}} = \bigtimes_{ι \in I_{G_{l}}^{m}} M_{ι},

M_{G_{l}^{m}} = \bigtimes_{ι \in I_{G_{l}}^{m}} M_{ι},

c_{ι} = \frac{⟨ C _{ι} ( X _{ω_{ι}} , Y _{ω_{ι}} ) Y _{ω_{ι}} , X _{ω_{ι}} ⟩}{⟨ X _{ω_{ι}} , X _{ω_{ι}} ⟩ ⟨ Y _{ω_{ι}} , Y _{ω_{ι}} ⟩ - ⟨ X _{ω_{ι}} , Y _{ω_{ι}} ⟩ ^{2}}

c_{ι} = \frac{⟨ C _{ι} ( X _{ω_{ι}} , Y _{ω_{ι}} ) Y _{ω_{ι}} , X _{ω_{ι}} ⟩}{⟨ X _{ω_{ι}} , X _{ω_{ι}} ⟩ ⟨ Y _{ω_{ι}} , Y _{ω_{ι}} ⟩ - ⟨ X _{ω_{ι}} , Y _{ω_{ι}} ⟩ ^{2}}

d_{G_{l}^{m}} (u_{G_{l}^{m}}, v_{G_{l}^{m}}) = ι \in I_{G_{l}}^{m} \sum d_{ι} (u_{ι}, v_{ι}) .

d_{G_{l}^{m}} (u_{G_{l}^{m}}, v_{G_{l}^{m}}) = ι \in I_{G_{l}}^{m} \sum d_{ι} (u_{ι}, v_{ι}) .

\overset{ˉ}{C}_{ι} (u_{ι}, v_{ι}, x_{ι}, y_{ι}) = ⟨ C_{ι} (U, V) X, Y ⟩_{ω_{ι}},

\overset{ˉ}{C}_{ι} (u_{ι}, v_{ι}, x_{ι}, y_{ι}) = ⟨ C_{ι} (U, V) X, Y ⟩_{ω_{ι}},

\overset{ˉ}{C}_{G_{l}^{m}} (u_{G_{l}^{m}}, v_{G_{l}^{m}}, x_{G_{l}^{m}}, y_{G_{l}^{m}}) = ι \in I_{G_{l}}^{m} \sum \overset{ˉ}{C}_{ι} (u_{ι}, v_{ι}, x_{ι}, y_{ι}),

\overset{ˉ}{C}_{G_{l}^{m}} (u_{G_{l}^{m}}, v_{G_{l}^{m}}, x_{G_{l}^{m}}, y_{G_{l}^{m}}) = ι \in I_{G_{l}}^{m} \sum \overset{ˉ}{C}_{ι} (u_{ι}, v_{ι}, x_{ι}, y_{ι}),

h (grad L (ω_{G_{l}^{m}}^{t}), g (t, Θ)) = - \frac{g ( t , Θ )}{g ( ω _{G_{l}^{m}}^{t} )} grad L (ω_{G_{l}^{m}}^{t}),

h (grad L (ω_{G_{l}^{m}}^{t}), g (t, Θ)) = - \frac{g ( t , Θ )}{g ( ω _{G_{l}^{m}}^{t} )} grad L (ω_{G_{l}^{m}}^{t}),

t = 0 \sum \infty g (t, Θ) = + \infty and t = 0 \sum \infty g (t, Θ)^{2} < \infty,

t = 0 \sum \infty g (t, Θ) = + \infty and t = 0 \sum \infty g (t, Θ)^{2} < \infty,

\|{\rm grad}\mathcal{L}(\omega_{G^{m}_{l}}^{t})\|_{2}=\Big{(}\sum\limits_{\iota\in\mathcal{I}_{G^{m}_{l}}}{\rm grad}\mathcal{L}(\omega_{l,\iota}^{t})^{2}\Big{)}^{\frac{1}{2}}.

\|{\rm grad}\mathcal{L}(\omega_{G^{m}_{l}}^{t})\|_{2}=\Big{(}\sum\limits_{\iota\in\mathcal{I}_{G^{m}_{l}}}{\rm grad}\mathcal{L}(\omega_{l,\iota}^{t})^{2}\Big{)}^{\frac{1}{2}}.

g (ω_{G_{l}^{m}}^{t}) = (max {1, (R_{G_{l}^{m}}^{t})^{2} (2 + R_{G_{l}^{m}}^{t})^{2}})^{\frac{1}{2}},

g (ω_{G_{l}^{m}}^{t}) = (max {1, (R_{G_{l}^{m}}^{t})^{2} (2 + R_{G_{l}^{m}}^{t})^{2}})^{\frac{1}{2}},

g (ω_{G_{l}^{m}}^{t}) = (max {1, (R_{G_{l}^{m}}^{t})^{2} (2 + R_{G_{l}^{m}}^{t})^{2}})^{\frac{1}{2}}, \forall m, l

g (ω_{G_{l}^{m}}^{t}) = (max {1, (R_{G_{l}^{m}}^{t})^{2} (2 + R_{G_{l}^{m}}^{t})^{2}})^{\frac{1}{2}}, \forall m, l

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Medical Imaging and Analysis · Advanced Numerical Analysis Techniques

MethodsStochastic Gradient Descent

Full text

Optimization on Product Submanifolds of Convolution Kernels

Mete Ozay, Takayuki Okatani

Graduate School of Information Sciences, Tohoku University, Sendai, Miyagi, Japan. {mozay,okatani}@vision.is.tohoku.ac.jp

Abstract

Recent advances in optimization methods used for training convolutional neural networks (CNNs) with kernels, which are normalized according to particular constraints, have shown remarkable success. This work introduces an approach for training CNNs using ensembles of joint spaces of kernels constructed using different constraints. For this purpose, we address a problem of optimization on ensembles of products of submanifolds (PEMs) of convolution kernels. To this end, we first propose three strategies to construct ensembles of PEMs in CNNs. Next, we expound their geometric properties (metric and curvature properties) in CNNs. We make use of our theoretical results by developing a geometry-aware SGD algorithm (G-SGD) for optimization on ensembles of PEMs to train CNNs. Moreover, we analyze convergence properties of G-SGD considering geometric properties of PEMs. In the experimental analyses, we employ G-SGD to train CNNs on Cifar-10, Cifar-100 and Imagenet datasets. The results show that geometric adaptive step size computation methods of G-SGD can improve training loss and convergence properties of CNNs. Moreover, we observe that classification performance of baseline CNNs can be boosted using G-SGD on ensembles of PEMs identified by multiple constraints.

1 Introduction

In the recent works [4, 5, 8, 10, 13, 16, 17, 19, 22, 23], several methods have been suggested to train deep neural networks using kernels (weights) with various normalization constraints to boost their performance. Spaces of normalized kernels have been explored using Riemannian manifolds (e.g. the Stiefel), and stochastic optimization algorithms have been employed to train CNNs using kernel manifolds in [7, 14, 15, 21].

In this work, we suggest an approach for training CNNs using multiple constraints on kernels in order to learn a richer set of features compared to the features learned using single constraints. We address this problem by optimization on ensembles of products of different kernel submanifolds (PEMs) that are identified by different constraints of kernels. However, if we employ the aforementioned Riemannian SGD algorithms [6, 7, 21] on PEMs to train CNNs, then we observe early divergence, vanishing and exploding gradients problems. Therefore, we elucidate geometric properties of PEMs to assure convergence to local minima while training CNNs using our proposed geometry-aware stochastic gradient descent (G-SGD). Our contributions are summarized as follows:

We explicate the geometry of space of convolution kernels defined by multiple constraints. For this purpose, we explore the relationship between geometric properties of PEMs, such as sectional curvature, geodesic distance, and gradients computed at PEMs, and those of component submanifolds of convolution kernels in CNNs (see Lemma 3.2 in Section 3). 2. 2.

We propose an SGD algorithm (G-SGD) for optimization on different ensembles of PEMs (Section 3) by generalizing the SGD methods employed on kernel submanifolds [14, 15, 21]. Next, we explore the effect of geometric properties of the PEMs on the convergence of the G-SGD using our theoretical results. Then, we employ the results for adaptive computation of step size of the SGD (see Theorem 3.3 and Corollary 3.4). Moreover, we provide an example for computation of a step size function for optimization on PEMs identified by the sphere (Corollary 3.4). In addition, we propose three strategies in order to construct ensembles of identical and non-identical kernel spaces according to their employment on input and output channels in CNNs in Section 2. To the best of our knowledge, our proposed G-SGD is the first algorithm which performs optimization on different ensembles of PEMs to train CNNs with convergence properties. 3. 3.

We experimentally analyze convergence properties and classification performance of CNNs on benchmark image classification datasets such as Cifar 10/100 and Imagenet, using various manifold ensemble schemes (Section 4). In the results, we observe that G-SGD employed on ensembles of PEMs can boost baseline state-of-the-art performance of CNNs.

Proofs of the theorems, additional results, and implementation details of the algorithms and datasets are given in the supplemental material.

2 Construction of Ensembles of PEMs

Suppose that we are given a set of training samples ${S=\{s_{i}=(\mathbf{I}_{i},y_{i})\}_{i=1}^{N}}$ of a random variable $s$ drawn from a distribution $\mathcal{P}$ on a measurable space $\mathfrak{S}$ , where $y_{i}$ is a class label of the $i^{th}$ image $\mathbf{I}_{i}$ . An $L$ -layer CNN consists of a set of tensors $\mathcal{W}=\{\mathcal{W}_{l}\}_{l=1}^{L}$ , where ${\mathcal{W}_{l}=\{\mathbf{W}_{d,l}\in\mathbb{R}^{A_{l}\times B_{l}\times C_{l}}\}_{d=1}^{D_{l}}}$ , and ${\mathbf{W}_{d,l}=[W_{c,d,l}\in\mathbb{R}^{A_{l}\times B_{l}}]_{c=1}^{C_{l}}}$ is a tensor111We use shorthand notation for matrix concatenation such that $[W_{c,d,l}]_{c=1}^{C_{l}}\triangleq[W_{1,d,l},W_{2,d,l},\cdots,W_{C_{l},d,l}]$ . composed of kernels (weight matrices) $W_{c,d,l}$ constructed at each layer ${l=1,2,\ldots,L}$ , for each $c^{th}$ channel $c=1,2,\ldots,C_{l}$ and each $d^{th}$ kernel $d=1,2,\ldots,D_{l}$ . At each $l^{th}$ convolution layer, we compute a feature representation $f_{l}(\mathbf{X}_{l};\mathcal{W}_{l})$ by compositionally employing non-linear functions, and convolving an image $\mathbf{I}$ with kernels by

[TABLE]

where ${\mathbf{X}_{1}:=\mathbf{I}}$ is an image for ${l=1}$ , and $\mathbf{X}_{l}=[X_{c,l}]_{c=1}^{C_{l}}$ . The $c^{th}$ channel of the data matrix $X_{c,l}$ is convolved with the kernel ${W}_{c,d,l}$ to obtain the $d^{th}$ feature map ${X_{c,l+1}:=\hat{X}_{d,l}}$ by $\hat{X}_{d,l}={W}_{c,d,l}\ast X_{c,l},\forall c,d,l$ 222We ignore the bias terms in the notation for simplicity.. Given a batch of samples $\mathfrak{s}\subseteq S$ , we denote a value of a classification loss function for a kernel $\omega\triangleq W_{c,d,l}$ by $\mathcal{L}(\omega,\mathfrak{s})$ , and the loss function of kernels $\mathcal{W}$ utilized in the CNN by $\mathcal{L}(\mathcal{W},\mathfrak{s})$ . Assuming that $\mathfrak{s}$ contains a single sample, an expected loss or cost function of the CNN is computed by

[TABLE]

The expected loss $\mathcal{L}(\omega)$ for $\omega$ is computed by

[TABLE]

For a finite set of samples $S$ , $\mathcal{L}(\mathcal{W})$ is approximated by an empirical loss $\frac{1}{|S|}\sum_{i=1}^{|S|}\mathcal{L}(\mathcal{W},s_{i})$ , where $|S|$ is the size of $S$ (similarly, $\mathcal{L}(\omega)$ is approximated by the empirical loss for $\omega$ ). Then, feature representations are learned by solving

[TABLE]

using an SGD algorithm. In the SGD algorithms employed on kernel submanifolds [14, 15, 21], each kernel is assumed to reside on an embedded kernel submanifold $\mathcal{M}_{c,d,l}$ at the $l^{th}$ layer of a CNN, such that ${\omega\in\mathcal{M}_{c,d,l}},\forall c,d$ . In this work, we propose a geometry-aware SGD algorithm (G-SGD), by generalizing the SGD algorithms [14, 15, 21] for optimization on ensembles of different products of the kernel submanifolds, which are defined next.

Definition 2.1 (Products of embedded kernel submanifolds of convolution kernels (PEMs) and their ensemble).

Suppose that ${\mathcal{G}_{l}=\{\mathcal{M}_{\iota}:\iota\in\mathcal{I}_{\mathcal{G}_{l}}\}}$ is an ensemble of Riemannian kernel submanifolds $\mathcal{M}_{\iota}$ of dimension $n_{\iota}$ , which is identified by a set of indices $\mathcal{I}_{\mathcal{G}_{l}},\forall{l=1,2,\ldots,L}$ . More concretely, $\mathcal{I}_{\mathcal{G}_{l}}$ contains indices each of which represents an identity number ( $\iota$ ) of a kernel that resides on a manifold $\mathcal{M}_{\iota}$ at the $l^{th}$ layer. In addition, a subset ${\mathcal{I}_{{G}_{l}}^{m}\subseteq\mathcal{I}_{\mathcal{G}_{l}}},{m=1,2,\ldots,M}$ , is used to determine a subset ${G}^{m}_{l}\subseteq\mathcal{G}_{l}$ of kernel submanifolds which will be aggregated to construct a PEM, and satisfies the following properties:

•

Each subset of indices contains at least one kernel such that ${\mathcal{I}_{{G}_{l}}^{m}}\neq\emptyset$ , for each $m=1,2,\ldots,M$ .

•

The set of indices $\mathcal{I}_{\mathcal{G}_{l}}$ is covered by the subsets ${\mathcal{I}_{{G}_{l}}^{m}}$ such that $\mathcal{I}_{\mathcal{G}_{l}}=\bigcup\limits_{m=1}^{M}{\mathcal{I}_{{G}_{l}}^{m}}$ .

•

If kernels are not shared among PEMs such that ensembles are constructed using non-overlapping sets, then $\mathcal{I}_{G_{l}}^{m}\cap\mathcal{I}_{{G}_{l}}^{\bar{m}}=\emptyset$ for $m\neq\bar{m}$ .

•

If kernels are shared among PEMs such that ensembles are constructed using overlapping sets, then ${\mathcal{I}_{G_{l}}^{m}\cap\mathcal{I}_{{G}_{l}}^{\bar{m}}\neq\emptyset}$ for $m\neq\bar{m}$ .

A $G^{m}_{l}$ product manifold of convolution kernels ( $G^{m}_{l}$ -PEM) constructed at the $l^{th}$ layer of an $L$ -layer CNN, denoted by $\mathbb{M}_{G^{m}_{l}}$ , is a product of embedded kernel submanifolds belonging to ${G}^{m}_{l}$ which is computed by

[TABLE]

where $\bigtimes$ is the topological Cartesian product, and therefore $\mathbb{M}_{G^{m}_{l}}$ is a product topology. Each ${\mathcal{M}_{\iota}\in{G}^{m}_{l}}$ is called a component submanifold of $\mathbb{M}_{G^{m}_{l}}$ . A kernel $\omega_{G^{m}_{l}}\in\mathbb{M}_{G^{m}_{l}}$ is then obtained by concatenating kernels belonging to $\mathcal{M}_{\iota}$ , $\forall\iota\in\mathcal{I}^{m}_{{G}_{l}}$ , using ${\omega_{G^{m}_{l}}=(\omega_{1},\omega_{2},\cdots,\omega_{|\mathcal{I}^{m}_{{G}_{l}}|})}$ , where $|\mathcal{I}^{m}_{{G}_{l}}|$ is the cardinality of $\mathcal{I}^{m}_{{G}_{l}}$ . A $\mathcal{G}_{l}$ -PEM is called an ensemble of PEMs constructed using (5) for $m=1,2,\ldots,M$ . $\blacksquare$

We compute a PEM $\mathbb{M}_{G^{m}_{l}}$ using component submanifolds $\mathcal{M}_{\iota}$ in (5) utilizing ${\mathcal{I}_{{G}_{l}}^{m}\subseteq\mathcal{I}_{\mathcal{G}_{l}}},m=1,2,\ldots,M$ , and construct ensembles of PEMs $\mathcal{G}_{l}$ using $\mathcal{I}_{\mathcal{G}_{l}}$ . Recall that, at each $l^{th}$ layer of an $L$ -layer CNN, we compute a convolution kernel ${\omega_{\iota}\triangleq W_{c,d,l}}$ , ${c\in\Lambda^{l}},\Lambda^{l}=\{1,2,\ldots,C_{l}\}$ , ${d\in O^{l}}$ , $O^{l}=\{1,2,\ldots,D_{l}\}$ . We first choose $\mathfrak{A}$ subsets of indices of input channels ${\Lambda_{a}\subseteq\Lambda^{l}},a=1,2,\ldots,\mathfrak{A}$ and $\mathfrak{B}$ subsets of indices of output channels $O_{b}\subseteq O^{l},b=1,2,\ldots,\mathfrak{B}$ , such that $\Lambda^{l}=\bigcup\limits_{a=1}^{\mathfrak{A}}\Lambda_{a}$ and $O^{l}=\bigcup\limits_{b=1}^{\mathfrak{B}}O_{b}$ . Then, we propose three strategies for determination of index sets (see Figure 1);

PEMs for input channels (PI): For each $c^{th}$ input channel, we construct $\mathcal{I}_{\mathcal{G}_{l}}=\bigcup\limits_{c=1}^{C_{l}}\mathcal{I}_{{G}_{l}}^{c}$ , where ${\mathcal{I}_{{G}_{l}}^{c}=O_{b}\times\{c\}}$ and the Cartesian product ${O_{b}\times\{c\}}$ preserves the input channel index, $\forall b,c$ . 2. 2.

PEMs for output channels (PO): For each $d^{th}$ output channel, we construct $\mathcal{I}_{\mathcal{G}_{l}}=\bigcup\limits_{d=1}^{D_{l}}\mathcal{I}_{{G}_{l}}^{d}$ , where ${\mathcal{I}_{{G}_{l}}^{d}=\Lambda_{a}\times\{d\}}$ and the Cartesian product $\Lambda_{a}\times\{d\}$ preserves the output channel index, $\forall a,d$ . 3. 3.

PEMs for input and output channels (PIO): We construct $\mathcal{I}_{{G}_{l}}^{a,b}=\mathcal{I}_{{G}_{l}}^{a}\cup\mathcal{I}_{{G}_{l}}^{b}$ , where $\mathcal{I}_{{G}_{l}}^{a}=\{\Lambda_{a}\times a\}$ and ${\mathcal{I}_{{G}_{l}}^{b}=\{O_{b}\times b\}}$ such that $\mathcal{I}_{\mathcal{G}_{l}}=\bigcup\limits_{a=1,b=1}^{\mathfrak{A},\mathfrak{B}}\mathcal{I}_{{G}_{l}}^{a,b}$ .

Example 2.2.

An illustration of employment of PI, PO and PIO at the $l^{th}$ layer of a CNN is given in Figure 1. Suppose that we have a kernel tensor of size $3\times 3\times 4\times 6$ where the number of input and output channels is $4$ and $6$ . In total, we have ${4*6=24}$ kernel matrices of size $3\times 3$ . An example of construction of an ensemble of PEMs is as follows.

PI: For each of 4 input channels, we split a set of 6 kernels associated with 6 output channels into two subsets of 3 kernels. Choosing the sphere (Sp) for the first subset, we construct a PEM as a product of 3 Sp using (5). That is, each of 3 component manifolds ${\mathcal{M}_{\iota}},{\iota=1,2,3}$ , of the PEM is a sphere. Similarly, choosing the Stiefel (St) for the second subset, we construct another PEM as a product of 3 St (each of 3 component manifolds ${\mathcal{M}_{\iota}},\iota=1,2,3$ , of the second PEM is a Stiefel manifold.). Thus, at this layer, we construct an ensemble of 4 PEMs of 3 St and 4 PEMs of 3 Sp. 2. 2.

PO: For each of 6 output channels, we split a set of 4 kernels corresponding to the input channels into two subsets of 2 kernels. We choose the Sp for the first subset, and we construct a PEM as a product of 2 Sp using (5). We choose the St for the second subset, and we construct a PEM as a product of 2 St. Thereby, we have an ensemble consisting of 6 PEMs of St and 6 PEMs of Sp. 3. 3.

PIO: We split the set of 24 kernels into 10 subsets. For each of 6 output channels, we split the set of kernels corresponding to the input channels into 3 subsets. We choose the Sp for 2 subsets each containing 3 kernels, and 3 subsets each containing 2 kernels. We choose the St similarly for the remaining subsets. Then, our ensemble contains 5 PEMs of St and 5 PEMs of Sp.

Our framework can be used to model both overlapping and non-overlapping sets. If ensembles are constructed using overlapping sets, then kernels having different constraints can be applied to the same input or output channels. For example, kernels belonging to a PEM of 3 St and kernels belonging to a PEM of 3 Sp can be applied to the same output (input) channel for PI (PO) in the previous example (see Figure 1). More complicated configurations can be obtained using PIO. In the experiments, we selected non-overlapping sets for simplicity. We consider theoretical and experimental analyses of overlapping sets as a future work.

3 Optimization on Ensembles of PEMs using Geometry-aware SGD in CNNs

If an SGD is employed on non-linear kernel submanifolds, then the gradient descent is generally performed by three steps; i) projection of gradients on tangent spaces of the submanifolds, ii) movement of kernels on the tangent spaces in the gradient descent direction, and iii) projection of the moved kernels onto the submanifolds [21]. These steps are determined according to the geometric properties of the submanifolds, such as sectional curvature and metric properties. For example, the Euclidean space has zero sectional curvature, i.e. it is not curved (flat). Thereby, these steps can be performed using a single step if an SGD employs kernels residing on the Euclidean space. However, if kernels belong to the unit sphere, then the kernel space is curved by constant positive curvature. Moreover, a different tangent space is computed at each kernel located on the sphere. Therefore, nonlinearity of operations and transformations applied on kernels implied by curvature and metric of kernel spaces are used for gradient descent in the aforementioned three steps. In addition, martingale properties of stochastic processes defined by kernels are determined by geodesics, metrics, gradients projected at tangent spaces and injectivity radius of kernel spaces (see proofs of Theorem 3.3 and Corollary 3.4 in the supp. mat. for details).

Geometric properties of PEMs can be different from that of the component submanifolds of PEMs, even if they are constructed using identical submanifolds. For example, we observe locally varying curvatures when we construct PEMs of spheres (see Figure 2). Kernel spaces with more complicated geometric properties can be obtained using the proposed strategies (PI, PO, PIO), especially by constructing ensembles of PEMs of non-identical submanifolds (see Section 4 for details and examples). Thus, as the complexity of geometry of kernel spaces increases, their effect on performance and convergence of SGD gradually increases.

In order to address these problems and consider geometric properties of kernel submanifolds for training of CNNs, we propose a geometry aware SGD (G-SGD). We employ metric properties of PEMs to perform gradient descent steps of G-SGD, and use curvature properties PEMs to explore convergence properties of G-SGD. We explore metric and curvature properties of PEMs in the next theorem.

Definition 3.1 (Sectional curvature of component submanifolds).

Let $\mathfrak{X}(\mathcal{M}_{\iota})$ denote the set of smooth vector fields on $\mathcal{M}_{\iota}$ . The sectional curvature of $\mathcal{M}_{\iota}$ associated with a two dimensional subspace $\mathfrak{T}\subset\mathcal{T}_{\omega_{\iota}}\mathcal{M}_{\iota}$ is defined by

[TABLE]

where $\mathcal{C}_{\iota}(X_{\omega_{\iota}},Y_{\omega_{\iota}})Y_{\omega_{\iota}}$ is the Riemannian curvature tensor333Additional definitions are given in the supp. mat., $\left\langle\cdot,\cdot\right\rangle$ is an inner product, ${X_{\omega_{\iota}}\in\mathfrak{X}(\mathcal{M}_{\iota})}$ and ${Y_{\omega_{\iota}}\in\mathfrak{X}(\mathcal{M}_{\iota})}$ form a basis of $\mathfrak{T}$ . $\blacksquare$

Lemma 3.2 (Metric and curvature properties of PEMs).

Suppose that $u_{\iota}\in\mathcal{T}_{\omega_{\iota}}\mathcal{M}_{\iota}$ and $v_{\iota}\in\mathcal{T}_{\omega_{\iota}}\mathcal{M}_{\iota}$ are tangent vectors belonging to the tangent space $\mathcal{T}_{\omega_{\iota}}\mathcal{M}_{\iota}$ computed at ${{\omega_{\iota}}\in\mathcal{M}_{\iota}}$ , $\forall\iota\in\mathcal{I}^{m}_{{G}_{l}},m=1,2,\ldots,M$ . Then, tangent vectors ${u_{G^{m}_{l}}\in\mathcal{T}_{\omega_{G^{m}_{l}}}\mathbb{M}_{G^{m}_{l}}}$ and ${v_{G^{m}_{l}}\in\mathcal{T}_{\omega_{G^{m}_{l}}}\mathbb{M}_{G^{m}_{l}}}$ are computed at $\omega_{G^{m}_{l}}\in\mathbb{M}_{G^{m}_{l}}$ by concatenation as ${u_{G^{m}_{l}}=(u_{1},u_{2},\cdots,u_{|\mathcal{I}^{m}_{{G}_{l}}|})}$ and ${v_{G^{m}_{l}}=(v_{1},v_{2},\cdots,v_{|\mathcal{I}^{m}_{{G}_{l}}|})}$ . If each kernel submanifold $\mathcal{M}_{\iota}$ is endowed with a Riemannian metric $\mathfrak{d}_{\iota}$ , then a $G^{m}_{l}$ -PEM is endowed with the metric $\mathfrak{d}_{G^{m}_{l}}$ computed by

[TABLE]

In addition, suppose that $\bar{C}_{\iota}$ is the Riemannian curvature tensor field (endomorphism) [20] of $\mathcal{M}_{\iota}$ , ${x_{\iota},y_{\iota}\in\mathcal{T}_{\omega_{\iota}}\mathcal{M}_{\iota}}$ , $\forall\iota\in\mathcal{I}^{m}_{{G}_{l}}$ defined by

[TABLE]

where $U,V,X,Y$ are vector fields such that $U_{\omega_{\iota}}=u_{\iota}$ , $V_{\omega_{\iota}}=v_{\iota}$ , $X_{\omega_{\iota}}=x_{\iota}$ , and $Y_{\omega_{\iota}}=y_{\iota}$ . Then, the Riemannian curvature tensor field $\bar{C}_{G_{l}}$ of $\mathbb{M}_{G_{l}}$ is computed by

[TABLE]

where ${x_{G^{m}_{l}}=(x_{1},x_{2},\cdots,x_{|\mathcal{I}^{m}_{{G}_{l}}|})}$ and ${y_{G^{m}_{l}}=(y_{1},y_{2},\cdots,y_{|\mathcal{I}^{m}_{{G}_{l}}|})}$ . Moreover, $\mathbb{M}_{G^{m}_{l}}$ has never strictly positive sectional curvature $\mathfrak{c}_{G^{m}_{l}}$ in the metric (7). In addition, if $\mathbb{M}_{G^{m}_{l}}$ is compact, then $\mathbb{M}_{G^{m}_{l}}$ does not admit a metric with negative sectional curvature $\mathfrak{c}_{G^{m}_{l}}$ . $\blacksquare$

We compute the metric of a $G^{m}_{l}$ -PEM $\mathbb{M}_{G^{m}_{l}}$ using the metrics identified on the component manifolds $\mathcal{M}_{\iota}$ employing (7) given in Lemma 3.2. In addition, we use the Riemannian curvature and sectional curvature of the $\mathbb{M}_{G^{m}_{l}}$ given in Lemma 3.2 to analyze convergence of our proposed G-SGD, and to compute adaptive step size.

Note that some sectional curvatures vanish on the $\mathbb{M}_{G^{m}_{l}}$ by the lemma. For instance, suppose that each $\mathcal{M}_{\iota}$ is a unit two-sphere $\mathbb{S}^{2}$ , $\forall\iota\in\mathcal{I}_{\mathcal{G}_{l}}$ (see Figure 2.a). Then, $\mathbb{M}_{G^{m}_{l}}$ computed by (5) has unit curvature along two-dimensional subspaces of its tangent spaces, called two-planes. On the other hand, $\mathbb{M}_{G^{m}_{l}}$ has zero curvature along all two-planes spanning exactly two distinct spheres. Therefore, learning rates need to be computed adaptively according to sectional curvatures at each layer of the CNN and at each epoch of the G-SGD for each kernel $\omega$ on each manifold $\mathbb{M}_{G^{m}_{l}}$ .

3.1 Optimization using G-SGD in CNNs

An algorithmic description of our proposed geometry-aware SGD (G-SGD) is given in Algorithm 12. At the initialization of the G-SGD, we identify the component embedded kernel submanifolds $\mathcal{M}_{\iota}$ according to the constraints that will be applied on the kernels $\omega_{\iota}\in\mathcal{M}_{\iota}$ . For instance, we employ an orthonormalization constraint $\|\omega_{\iota}\|_{F}=1$ for kernels $\omega_{\iota}$ residing on $n_{\iota}$ dimensional unit sphere $\mathcal{M}_{\iota}\equiv\mathbb{S}^{n_{\iota}}$ , where $\|\cdot\|_{F}$ is the Frobenius norm [2]444In the experimental analyses, we use the oblique and the Stiefel manifolds as well as the sphere and the Euclidean space to identify subcomponent manifolds $\mathcal{M}_{\iota}$ ..

When we employ a G-SGD on a $G^{m}_{l}$ -PEM $\mathbb{M}_{G^{m}_{l}}$ , each kernel $\omega_{G^{m}_{l}}^{t}\in\mathbb{M}_{G^{m}_{l}}$ is moved on the $G^{m}_{l}$ -PEM in the descent direction of gradient of loss at each $t^{th}$ step of the G-SGD. More precisely, direction and amount of movement of a kernel $\omega^{t}_{G^{m}_{l}}$ are determined at the $t^{th}$ step and the $l^{th}$ layer by the following steps of Algorithm 12:

Line 6: Using Lemma 3.2, the gradient ${\rm grad}_{E}\;\mathcal{L}(\omega_{G^{m}_{l}}^{t})$ , which is obtained using back-propagation from the upper layer, is projected onto the tangent space $\mathcal{T}_{\omega^{t}_{G^{m}_{l}}}\mathbb{M}_{G^{m}_{l}}$ at ${\rm grad}\mathcal{L}(\omega_{G^{m}_{l}}^{t})$ , where $\mathcal{T}_{\omega^{t}_{G^{m}_{l}}}\mathbb{M}_{G^{m}_{l}}=\bigtimes\limits_{\iota\in\mathcal{I}_{G^{m}_{l}}}\mathcal{T}_{\omega^{t}_{\iota,l}}\mathbb{M}_{\iota}$ . 2. 2.

Line 7: Movement of $\omega^{t}_{G^{m}_{l}}$ on $\mathcal{T}_{\omega^{t}_{G^{m}_{l}}}\mathbb{M}_{G^{m}_{l}}$ using $h({\rm grad}\mathcal{L}(\omega_{G^{m}_{l}}^{t}),g(t,\Theta))$ computed by

[TABLE]

where $g(t,\Theta)$ is the learning rate that satisfies

[TABLE]

$\mathfrak{g}(\omega_{G^{m}_{l}}^{t})=\max\{1,\Gamma_{1}^{t}\}^{\frac{1}{2}}$ , $\Gamma_{1}^{t}=(R_{G^{m}_{l}}^{t})^{2}\Gamma_{2}^{t}$ , ${\Gamma_{2}^{t}=\max\{(2\rho_{G^{m}_{l}}^{t}+R_{G^{m}_{l}}^{t})^{2},(1+\mathfrak{c}_{G^{m}_{l}}(\rho_{G^{m}_{l}}^{t}+R_{G^{m}_{l}}^{t}))\}}$ , $\rho_{G^{m}_{l}}^{t}\triangleq\rho(\omega_{G^{m}_{l}}^{t},\hat{\omega}_{G^{m}_{l}})$ is the geodesic distance between $\omega_{G^{m}_{l}}^{t}$ and a local minima $\hat{\omega}_{G^{m}_{l}}$ on $\mathbb{M}_{G^{m}_{l}}$ , $\mathfrak{c}_{G^{m}_{l}}$ is the sectional curvature of $\mathbb{M}_{G^{m}_{l}}$ , $R_{G^{m}_{l}}^{t}\triangleq\|{\rm grad}\mathcal{L}(\omega_{G^{m}_{l}}^{t})\|_{2}$ which can be computed using Lemma 3.2 by

[TABLE] 3. 3.

Line 8: Projection of the moved kernel at $v_{t}$ onto the manifold $\mathbb{M}_{G^{m}_{l}}$ using $\phi_{\omega_{G^{m}_{l}}^{t}}(v_{t})$ to compute $\omega^{t+1}_{G^{m}_{l}}$ , where $\phi_{\omega_{G^{m}_{l}}^{t}}(v_{t})$ is an exponential map, or a retraction which is an approximation of the exponential map [3].

The denominator $\mathfrak{g}(\omega_{G^{m}_{l}}^{t})$ used for computation of the step size in (10) is employed as a regularizer to control the change of gradient ${\rm grad}\mathcal{L}(\omega_{G^{m}_{l}}^{t})$ at each step of G-SGD. This property is examined in the experimental analyses for PEMs of different manifolds. For computation of $\mathfrak{g}(\omega_{G^{m}_{l}}^{t})$ , we use (12) utilizing Lemma 3.2. Unlike related works, kernels residing on each PEM are moved and projected jointly on the PEMs in G-SGD, by which we can employ their interaction using the corresponding gradients considering nonlinear geometry of manifolds. G-SGD can perform optimization on PEMs and their ensemble according to sets $G^{m}_{l},\forall m$ , recursively. Thereby, G-SGD can consider interactions between component manifolds as well as those between PEMs in an ensemble. SGD methods studied in the literature do not have assurance of convergence when it is applied to optimization on ensembles of PEMs. Employment of (10) and (11) at line 7, and retractions at line 8 are essential for assurance of convergence as explained next.

3.2 Convergence Properties of G-SGD

In some machine learning tasks, such as clustering [6, 24], the geodesic distance $\rho_{G^{m}_{l}}^{t}$ can be computed in closed form. However, a closed form solution may not be computed using CNNs due to the challenge of computation of local minima. Therefore, we provide an asymptotic convergence property for Algorithm 12 in the next theorem.

Theorem 3.3.

Suppose that there exists a local minimum $\hat{\omega}_{G^{m}_{l}}\in\mathbb{M}_{G^{m}_{l}},\forall G^{m}_{l}\subseteq\mathcal{G}_{l},\forall l$ , and $\exists\epsilon>0$ such that $\inf\limits_{\rho_{G^{m}_{l}}^{t}>\epsilon^{\frac{1}{2}}}\left\langle\phi_{\omega_{G^{m}_{l}}^{t}}(\hat{\omega}_{G^{m}_{l}})^{-1},\nabla\mathcal{L}(\omega_{G^{m}_{l}}^{t})\right\rangle<0$ , where $\phi$ is an exponential map or a twice continuously differentiable retraction, and $\langle\cdot,\cdot\rangle$ is the inner product. The loss function and the gradient converges almost surely (a.s.) by $\mathcal{L}(\omega^{t}_{G^{m}_{l}})\xrightarrow[t\to\infty]{\rm a.s.}\mathcal{L}(\hat{\omega}_{G^{m}_{l}})$ , and $\nabla\mathcal{L}(\omega^{t}_{G^{m}_{l}})\xrightarrow[t\to\infty]{\rm a.s.}0$ , for each $\mathbb{M}_{G^{m}_{l}},\forall l$ . $\blacksquare$

Theorem 3.3 assures convergence of the G-SGD (Algorithm 12) to minima. For implementation of G-SGD, we use the result given in Lemma 3.2 for PEMs to employ sectional curvatures. Although sectional curvatures of non-identical embedded kernel submanifolds can be different [21], Lemma 3.2 assures existence of zero sectional curvature in PEMs along their tangent spaces. In the next theorem, we provide an example for computation of a step size function $\mathfrak{g}(\cdot)$ for component embedded kernel submanifolds determined by the sphere using the result given in Lemma 3.2, and explore its convergence property using Theorem 3.3.

Corollary 3.4.

Suppose that $\mathbb{M}_{\iota}$ are identified by ${n_{\iota}\geq 2}$ dimensional unit sphere $\mathbb{S}^{n_{\iota}}$ , and $\rho_{G^{m}_{l}}^{t}\leq\hat{\mathfrak{c}}^{-1}$ , where $\hat{\mathfrak{c}}$ is an upper bound on the sectional curvatures of $\mathbb{M}_{G^{m}_{l}},\forall l$ at $\omega_{G^{m}_{l}}^{t}\in\mathbb{M}_{G^{m}_{l}},\forall t$ . If step size is computed using (10) with

[TABLE]

then ${\mathcal{L}(\omega^{t}_{G^{m}_{l}})\xrightarrow[t\to\infty]{\rm a.s.}\mathcal{L}(\hat{\omega}_{G^{m}_{l}})}$ , and ${\nabla\mathcal{L}(\omega^{t}_{G^{m}_{l}})\xrightarrow[t\to\infty]{\rm a.s.}0}$ , for each $\mathbb{M}_{G^{m}_{l}},\forall l$ . $\blacksquare$

In the experimental analyses, we use different step size functions and analyze convergence properties and performance of CNNs trained using G-SGD by relaxing assumptions of Theorem 3.3 and Corollary 3.4 for different CNN architectures and benchmark image classification datasets.

4 Experimental Analyses

We examine the proposed G-SGD method for training of state-of-the-art CNNs, called Residual Networks (Resnets) [9], equipped with different number of layers and kernels. We use three benchmark RGB image classification datasets, namely Cifar-10, Cifar-100 and Imagenet [18]. The Cifar-10 and Cifar-100 datasets consist of $5\times 10^{4}$ training and $10^{4}$ test images belonging to 10 and 100 classes, respectively. The Imagenet dataset consists of $10^{3}$ classes ( $12\times 10^{4}$ training and $5\times 10^{4}$ validation images).

We construct ensembles of PEMs using the sphere (Sp), the oblique (Ob) and the Stiefel (St) manifolds. We also use the kernels residing on the ambient Euclidean space of embedded kernel submanifolds (Euc.). In order to preserve the task structure (classification of RGB images), we employed PI for the layers $l=2,3,\ldots,L$ considering the RGB space of images, PO for $l=1,2,\ldots,L-1$ considering the number of classes learned at the top $L^{th}$ layer of a CNN, and PIO for $l=2,\ldots,L-1$ . Suppose that we have a set of $N_{l}$ kernels $\mathfrak{N}_{l}$ with $|\mathfrak{N}_{l}|=N_{l}$ and $|\mathcal{I}_{\mathcal{G}_{l}}|=N_{l}$ at the $l^{th}$ layer of a CNN. In the construction of ensembles, we employ PI, PO and PIO using a kernel set splitting (KSS) scheme. In KSS, we split the kernel set $\mathfrak{N}_{l}$ into $M$ subsets ${\mathfrak{N}_{l}^{m}\subset\mathfrak{N}_{l}}$ , $\forall m=1,2,\ldots,M$ , where kernels ${\omega\in\mathfrak{N}^{m}_{l}}$ belonging to $\mathfrak{N}_{l}^{m}$ reside on the $m^{th}$ PEM $\mathbb{M}_{G^{m}_{l}}$ identified by $\mathcal{I}_{G^{m}_{l}}$ which is determined according to PI, PO and PIO, $\forall m$ . For the sake of simplicity of the analyses, we split the kernel set into subsets with size $\frac{N_{l}}{M}$ in KSS, while the proposed schemes enable us to construct new kernel sets with varying size. Implementation details of G-SGD for different ensembles and Resnets, data pre-processing details of the benchmark datasets and additional results are given in the supp. mat.

4.1 Analysis of Classification Performance on Benchmark Datasets

We analyze classification performance of CNNs trained using G-SGD on benchmark Cifar-10, Cifar-100 and Imagenet datasets. In order to construct ensembles of kernels belonging to Euc., Sp, St and Ob using KSS, we increase the number of kernels used in CNNs to 24 and its multiples (see the supp. mat.). We use other hyperparameters of CNNs as suggested in [9, 12, 21]. We depict performance of our implementation of CNNs for baseline geometries (Euc., Sp, St and Ob) by $\dagger$ marker in the tables. For computation of $\mathfrak{g}(\omega_{G^{m}_{l}}^{t})$ , we used

[TABLE]

as suggested in Corollary 3.4. Implementation details are given in the supp. mat.

We examine classification performance of Resnets with 44 layers (Resnet-44) and 18 layers (Resnet-18) on Cifar-10 with data augmentation (DA) and Imagenet in Table I and Table II, respectively. The results show that performance of CNNs are boosted by employing ensembles of PEMs (denoted by PI, PO and PIO for PEMs) using G-SGD compared to the employment of baseline Euc. We observe that PEMs of component submanifolds of identical geometry (denoted by PEMs of Sp/St/Ob), and their ensembles (denoted by PI, PO, PIO for PEMs of Sp/St/Ob) provide better performance compared to employment of component submanifolds (denoted by Sp/Ob/St) [21]. For instance, we obtain $28.64\%$ , $28.72\%$ and $27.83\%$ error using PIO for PEMs of Sp, Ob and St in Table II, respectively. However, the error obtained using Sp, Ob and St is $28.71\%$ , $28.83\%$ and $28.02\%$ , respectively.

In addition, we obtain $0.28\%$ and $2.06\%$ boost of the performance by ensemble of the St with Euc. ( $6.77\%$ and $28.25\%$ using PIO for Euc.+St, respectively) for the experiments on the Cifar-10 and Imagenet datasets using the PIO scheme in Table I and Table II, respectively. Moreover, we observe that construction of ensembles using Ob performs better for PI compared to PO. For instance, we observe that PI for PEMs of Ob provides $6.81\%$ and $28.75\%$ while PO for PEMS of Ob provides $6.83\%$ and $28.81\%$ in Table I and Table II, respectively. We may associate this result with the observation that kernels belonging to Ob are used for feature selection and modeling of texture patterns with high performance [1, 21]. However, ensembles of St and Sp perform better for PO ( $6.59\%$ and $28.01\%$ in Table I and Table II) compared to PI ( $6.67\%$ and $28.64\%$ in Table I and Table II) on kernels employed on output channels.

It is also observed that PIO performs better than PI and PO in all the experiments. We observe $1.13\%$ and $3.24\%$ boost by construction of an ensemble of four manifolds (Sp+Ob+St+Euc.) using the PIO scheme in Table I ( $5.92\%$ ) and Table II ( $27.07\%$ ), respectively. In other words, ensemble methods boost the performance of large-scale CNNs more for large-scale datasets (e.g. Imagenet) consisting of larger number of samples and classes compared to the performance of smaller CNNs employed on smaller datasets (e.g. Cifar-10). This result can be attributed to enhancement of sets of features learned using multiple constraints on kernels.

We analyze this observation by examining the performance of larger CNNs consisting of 110 layers on Cifar-10 and Cifar-100 datasets with and without using DA in Table III. The results show that employment of PEMs can boost the performance of CNNs that use component submanifolds (e.g. PEMs of Sp, Ob and St) more for larger networks (Table III) compared to smaller networks (Table I and Table II). Moreover, employment of PIO for PEMs of Sp+Ob+St+Euc. boosts the performance of CNNs that use Euc. more for Cifar-100 (3.55% boost in average) compared to the performance obtained for Cifar-10 (1.58% boost in average). In addition, we observe that ensembles boost the performance of CNNs that use DA methods more compared to the performance of CNNs without using DA.

Our method fundamentally differs from network ensembles. In order to analyze the results for network ensembles of CNNs, we employed an ensemble method [9] by voting of decisions of Resnet 44 on Cifar 10. When CNNs trained on individual Euc, Sp, Ob, and St are ensembled using voting, we obtained $7.02\%$ (Euc+Sp+Ob+St) and $6.85\%$ (Sp+Ob+St) errors (see Table 1 for comparison). In our analyses of ensembles (PI, PO and PIO), each PEM contains $\frac{N_{l}}{M}$ kernels, where $N_{l}$ is the number of kernels used at the $l^{th}$ layer, and $M$ is the number of PEMs. When each CNN in the ensemble was trained using an individual manifold which contains $\frac{1}{4}$ of kernels (using $M=4$ as utilized in our experiments), then we obtained $11.02\%$ (Euc), $7.76\%$ (Sp), $7.30\%$ (Ob), $7.18\%$ (St), $9.44\%$ (Euc+Sp+Ob+St) and $7.05\%$ (Sp+Ob+St) errors. Thus, our proposed methods outperform ensembles constructed by voting. Additional results are given in the supplemental material.

5 Conclusion and Discussion

We introduced and elucidated a problem of training CNNs using multiple constraints employed on convolution kernels with convergence properties. Following our theoretical results, we proposed the G-SGD algorithm and adaptive step size estimation methods for optimization on ensembles of PEMs that are identified by the constraints. The experimental results show that our proposed methods can improve convergence properties and classification performance of CNNs. Overall, the results show that employment of ensembles of PEMs using G-SGD can boost the performance of larger CNNs (e.g. RCD and RSD) on large scale datasets (e.g. Imagenet) more compared to the performance of small and medium scale networks (e.g. Resnets with 16 and 44 layers) employed on smaller datasets (e.g. Cifar-10).

In future work, we plan to extend the proposed framework by development of new ensemble schemes to perform various tasks such as machine translation and video recognition using CNNs and Recurrent Neural Networks (RNNs). In addition, the proposed methods can be applied to other stochastic optimization methods such as Adam and trust region methods. We believe that our proposed framework will be useful for researchers to study geometric properties of parameter spaces of deep networks, and to improve our understanding of deep feature representations.

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P. A. Absil and K. A. Gallivan. Joint diagonalization on the oblique manifold for independent component analysis. In Proc. 31st IEEE Int. Conf. Acoust., Speech Signal Process , volume 5, pages 945–948, Toulouse, France, May 2006.
2[2] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds . PUP, Princeton, NJ, USA, 2007.
3[3] P. A. Absil and J. Malick. Projection-like retractions on matrix manifolds. SIAM Journal on Optimization , 22(1):135–158, 2012.
4[4] M. Arjovsky, A. Shah, and Y. Bengio. Unitary evolution recurrent neural networks. In Proc. of the 33rd Int. Conf. on Mach. Learn. ICML , pages 1120–1128, New York City, NY, USA, June 2016.
5[5] D. Arpit, Y. Zhou, B. U. Kota, and V. Govindaraju. Normalization propagation: A parametric technique for removing internal covariate shift in deep networks. In Proc. of the 33rd Int. Conf. on Mach. Learn. ICML , pages 1168–1176, New York City, NY, USA, June 2016.
6[6] S. Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Trans. Autom. Control , 58(9):2217–2229, Sept 2013.
7[7] M. Cho and J. Lee. Riemannian approach to batch normalization. In Advances in Neural Information Processing Systems (NIPS) , 2017.
8[8] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: Improving robustness to adversarial examples. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning , volume 70 of Proceedings of Machine Learning Research , pages 854–863, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Optimization on Product Submanifolds of Convolution Kernels

Abstract

1 Introduction

2 Construction of Ensembles of PEMs

Definition 2.1** (Products of embedded kernel submanifolds of convolution kernels (PEMs) and their ensemble).**

Example 2.2**.**

3 Optimization on Ensembles of PEMs using Geometry-aware SGD in CNNs

Definition 3.1** (Sectional curvature of component submanifolds).**

Lemma 3.2** (Metric and curvature properties of PEMs).**

3.1 Optimization using G-SGD in CNNs

3.2 Convergence Properties of G-SGD

Theorem 3.3**.**

Corollary 3.4**.**

4 Experimental Analyses

4.1 Analysis of Classification Performance on Benchmark Datasets

5 Conclusion and Discussion

Definition 2.1 (Products of embedded kernel submanifolds of convolution kernels (PEMs) and their ensemble).

Example 2.2.

Definition 3.1 (Sectional curvature of component submanifolds).

Lemma 3.2 (Metric and curvature properties of PEMs).

Theorem 3.3.

Corollary 3.4.