Optimization on Product Submanifolds of Convolution Kernels
Mete Ozay, Takayuki Okatani

TL;DR
This paper develops a geometry-aware optimization method for CNNs that trains on ensembles of product submanifolds of kernels, improving convergence and classification performance across multiple datasets.
Contribution
It introduces a novel approach for training CNNs on ensembles of product submanifolds of kernels, with a geometry-aware SGD algorithm and convergence analysis.
Findings
G-SGD improves training loss and convergence.
Classification performance is boosted using ensembles of PEMs.
Geometry-aware step size methods enhance CNN training.
Abstract
Recent advances in optimization methods used for training convolutional neural networks (CNNs) with kernels, which are normalized according to particular constraints, have shown remarkable success. This work introduces an approach for training CNNs using ensembles of joint spaces of kernels constructed using different constraints. For this purpose, we address a problem of optimization on ensembles of products of submanifolds (PEMs) of convolution kernels. To this end, we first propose three strategies to construct ensembles of PEMs in CNNs. Next, we expound their geometric properties (metric and curvature properties) in CNNs. We make use of our theoretical results by developing a geometry-aware SGD algorithm (G-SGD) for optimization on ensembles of PEMs to train CNNs. Moreover, we analyze convergence properties of G-SGD considering geometric properties of PEMs. In the experimental…
| Model | Class. Error(%) |
|---|---|
| Euc. [9] | 7.17 |
| Euc. [21] | 7.16 |
| Euc. | 7.05 |
| Sp/Ob/St [21] | 6.99/6.89/6.81 |
| Sp/Ob/St | 6.84/6.87/ 6.73 |
| PEMs of Sp/Ob/St | 6.81/6.85/ 6.70 |
| PI for PEMs of Sp/Ob/St | 6.82/6.81/ 6.70 |
| PI (Euc.+Sp/Euc.+St/Euc.+Ob) | 6.89/6.84/6.88 |
| PI (Sp+Ob/Sp+St/Ob+St) | 6.75/6.67/6.59 |
| PI (Sp+Ob+St/Sp+Ob+St+Euc.) | 6.31/6.34 |
| PO for PEMs of Sp/Ob/St | 6.77/6.83/ 6.65 |
| PO (Euc.+Sp/Euc.+St/Euc.+Ob) | 6.85/6.78/6.90 |
| PO (Sp+Ob/Sp+St/Ob+St) | 6.62/6.59/6.51 |
| PO (Sp+Ob+St/Sp+Ob+St+Euc.) | 6.35/6.22 |
| PIO for PEMs of Sp/Ob/St | 6.71/6.73/ 6.61 |
| PIO (Euc.+Sp/Euc.+St/Euc.+Ob) | 6.95/6.77/6.82 |
| PIO (Sp+Ob/Sp+St/Ob+St) | 6.21/6.19/6.25 |
| PIO (Sp+Ob+St/Sp+Ob+St+Euc.) | 5.95/ 5.92 |
| Model | Top-1 Error (%) |
|---|---|
| Euc. [21] | 30.59 |
| Euc. | 30.31 |
| Sp/Ob/St[21] | 29.13/28.97/28.14 |
| Sp/Ob/St | 28.71/28.83/28.02 |
| PEMs of Sp/Ob/St | 28.70/28.77/28.00 |
| PI for PEMs of Sp/Ob/St | 28.69/28.75/27.91 |
| PI (Euc.+Sp/Euc.+St/Euc.+Ob) | 30.05/29.81/29.88 |
| PI (Sp+Ob/Sp+St/Ob+St) | 28.61/28.64/28.49 |
| PI (Sp+Ob+St/Sp+Ob+St+Euc.) | 27.63/27.45 |
| PO for PEMs of Sp/Ob/St | 28.67/28.81/27.86 |
| PO (Euc.+Sp/Euc.+St/Euc.+Ob) | 29.58/29.51/29.90 |
| PO (Sp+Ob/Sp+St/Ob+St) | 28.23/28.01/28.17 |
| PO (Sp+Ob+St/Sp+Ob+St+Euc.) | 27.81/27.51 |
| PIO for PEMs of Sp/Ob/St | 28.64/28.72/27.83 |
| PIO (Euc.+Sp/Euc.+St/Euc.+Ob) | 29.19/28.25/28.53 |
| PIO (Sp+Ob/Sp+St/Ob+St) | 28.14/27.66/27.90 |
| PIO (Sp+Ob+St/Sp+Ob+St+Euc.) | 27.11/ 27.07 |
| Model | Cifar-10 w. DA | Cifar-100 w. DA | Cifar-10 w/o DA | Cifar-100 w/o DA |
|---|---|---|---|---|
| RCD [11] | 6.41 | 27.22 | 13.63 | 44.74 |
| (Euc.) | 6.30 | 27.01 | 13.57 | 44.65 |
| Sp/Ob/St ([21]) | 6.22/6.07/5.93 | 26.44/25.99/25.41 | 13.11/12.94/12.88 | 42.51/42.30/40.11 |
| Sp/Ob/St | 6.05/6.03/5.91 | 26.19/25.87/25.39 | 12.96/12.85/12.79 | 42.13/42.00/39.94 |
| PEMs of Sp/Ob/St | 6.00/6.01/5.86 | 25.93/25.74/25.18 | 12.74/12.77/12.74 | 42.02/42.88/39.90 |
| PIO for PEMs of Sp/Ob/St | 5.95/5.91/5.83 | 25.89/25.71/25.12 | 12.71/12.72/12.69 | 41.68/42.75/39.83 |
| PIO (Euc.+Sp/Euc.+St/Euc.+Ob) | 6.03/5.99/6.01 | 25.57/25.49/25.64 | 12.77/12.21/12.92 | 41.90/41.37/41.85 |
| PIO (Sp+Ob/Sp+St/Ob+St) | 5.97/5.86/5.46 | 24.71/24.96/24.76 | 11.47/11.65/ 11.51 | 41.49/40.53/40.34 |
| PIO (Sp+Ob+St/Sp+Ob+St+Euc.) | 5.25/ 5.17 | 23.96/ 23.79 | 11.29/ 11.15 | 39.53/ 39.35 |
| RSD [11] | 5.23 | 24.58 | 11.66 | 37.80 |
| Euc. | 5.17 | 24.39 | 11.40 | 37.55 |
| Sp/Ob/St [21] | 5.20/5.14/4.79 | 23.77/23.81/23.16 | 10.91/10.93/10.46 | 36.90/36.47/35.92 |
| Sp/Ob/St | 5.08/5.11/4.73 | 23.69/23.75/23.09 | 10.52/10.66/10.33 | 36.71/36.38/35.85 |
| PEMs of Sp/Ob/St | 5.05/5.08/4.69 | 23.51/23.60/23.85 | 10.41/10.54/10.25 | 36.40/36.11/35.53 |
| PIO for PEMs of Sp/Ob/St | 4.95/5.03/4.62 | 23.47/23.51/23.77 | 10.37/10.51/10.19 | 36.33/36.02/35.41 |
| PIO (Euc.+Sp/Euc.+St/Euc.+Ob) | 5.00/5.08/5.14 | 23.69/23.25/23.32 | 10.74/10.25/10.93 | 35.76/35.55/35.81 |
| PIO (Sp+Ob/Sp+St/Ob+St) | 4.70/4.58/4.90 | 22.84/22.91/22.80 | 10.13/10.24/10.06 | 35.66/35.01/35.35 |
| PIO (Sp+Ob+St/Sp+Ob+St+Euc.) | 4.29/4.31 | 22.19/ 22.03 | 9.52/9.56 | 34.49/ 34.25 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Medical Imaging and Analysis · Advanced Numerical Analysis Techniques
MethodsStochastic Gradient Descent
Optimization on Product Submanifolds of Convolution Kernels
Mete Ozay, Takayuki Okatani
Graduate School of Information Sciences, Tohoku University, Sendai, Miyagi, Japan. {mozay,okatani}@vision.is.tohoku.ac.jp
Abstract
Recent advances in optimization methods used for training convolutional neural networks (CNNs) with kernels, which are normalized according to particular constraints, have shown remarkable success. This work introduces an approach for training CNNs using ensembles of joint spaces of kernels constructed using different constraints. For this purpose, we address a problem of optimization on ensembles of products of submanifolds (PEMs) of convolution kernels. To this end, we first propose three strategies to construct ensembles of PEMs in CNNs. Next, we expound their geometric properties (metric and curvature properties) in CNNs. We make use of our theoretical results by developing a geometry-aware SGD algorithm (G-SGD) for optimization on ensembles of PEMs to train CNNs. Moreover, we analyze convergence properties of G-SGD considering geometric properties of PEMs. In the experimental analyses, we employ G-SGD to train CNNs on Cifar-10, Cifar-100 and Imagenet datasets. The results show that geometric adaptive step size computation methods of G-SGD can improve training loss and convergence properties of CNNs. Moreover, we observe that classification performance of baseline CNNs can be boosted using G-SGD on ensembles of PEMs identified by multiple constraints.
1 Introduction
In the recent works [4, 5, 8, 10, 13, 16, 17, 19, 22, 23], several methods have been suggested to train deep neural networks using kernels (weights) with various normalization constraints to boost their performance. Spaces of normalized kernels have been explored using Riemannian manifolds (e.g. the Stiefel), and stochastic optimization algorithms have been employed to train CNNs using kernel manifolds in [7, 14, 15, 21].
In this work, we suggest an approach for training CNNs using multiple constraints on kernels in order to learn a richer set of features compared to the features learned using single constraints. We address this problem by optimization on ensembles of products of different kernel submanifolds (PEMs) that are identified by different constraints of kernels. However, if we employ the aforementioned Riemannian SGD algorithms [6, 7, 21] on PEMs to train CNNs, then we observe early divergence, vanishing and exploding gradients problems. Therefore, we elucidate geometric properties of PEMs to assure convergence to local minima while training CNNs using our proposed geometry-aware stochastic gradient descent (G-SGD). Our contributions are summarized as follows:
We explicate the geometry of space of convolution kernels defined by multiple constraints. For this purpose, we explore the relationship between geometric properties of PEMs, such as sectional curvature, geodesic distance, and gradients computed at PEMs, and those of component submanifolds of convolution kernels in CNNs (see Lemma 3.2 in Section 3). 2. 2.
We propose an SGD algorithm (G-SGD) for optimization on different ensembles of PEMs (Section 3) by generalizing the SGD methods employed on kernel submanifolds [14, 15, 21]. Next, we explore the effect of geometric properties of the PEMs on the convergence of the G-SGD using our theoretical results. Then, we employ the results for adaptive computation of step size of the SGD (see Theorem 3.3 and Corollary 3.4). Moreover, we provide an example for computation of a step size function for optimization on PEMs identified by the sphere (Corollary 3.4). In addition, we propose three strategies in order to construct ensembles of identical and non-identical kernel spaces according to their employment on input and output channels in CNNs in Section 2. To the best of our knowledge, our proposed G-SGD is the first algorithm which performs optimization on different ensembles of PEMs to train CNNs with convergence properties. 3. 3.
We experimentally analyze convergence properties and classification performance of CNNs on benchmark image classification datasets such as Cifar 10/100 and Imagenet, using various manifold ensemble schemes (Section 4). In the results, we observe that G-SGD employed on ensembles of PEMs can boost baseline state-of-the-art performance of CNNs.
Proofs of the theorems, additional results, and implementation details of the algorithms and datasets are given in the supplemental material.
2 Construction of Ensembles of PEMs
Suppose that we are given a set of training samples of a random variable drawn from a distribution on a measurable space , where is a class label of the image . An -layer CNN consists of a set of tensors , where , and is a tensor111We use shorthand notation for matrix concatenation such that . composed of kernels (weight matrices) constructed at each layer , for each channel and each kernel . At each convolution layer, we compute a feature representation by compositionally employing non-linear functions, and convolving an image with kernels by
[TABLE]
where is an image for , and . The channel of the data matrix is convolved with the kernel to obtain the feature map by 222We ignore the bias terms in the notation for simplicity.. Given a batch of samples , we denote a value of a classification loss function for a kernel by , and the loss function of kernels utilized in the CNN by . Assuming that contains a single sample, an expected loss or cost function of the CNN is computed by
[TABLE]
The expected loss for is computed by
[TABLE]
For a finite set of samples , is approximated by an empirical loss , where is the size of (similarly, is approximated by the empirical loss for ). Then, feature representations are learned by solving
[TABLE]
using an SGD algorithm. In the SGD algorithms employed on kernel submanifolds [14, 15, 21], each kernel is assumed to reside on an embedded kernel submanifold at the layer of a CNN, such that . In this work, we propose a geometry-aware SGD algorithm (G-SGD), by generalizing the SGD algorithms [14, 15, 21] for optimization on ensembles of different products of the kernel submanifolds, which are defined next.
Definition 2.1** (Products of embedded kernel submanifolds of convolution kernels (PEMs) and their ensemble).**
Suppose that is an ensemble of Riemannian kernel submanifolds of dimension , which is identified by a set of indices . More concretely, contains indices each of which represents an identity number () of a kernel that resides on a manifold at the layer. In addition, a subset , is used to determine a subset of kernel submanifolds which will be aggregated to construct a PEM, and satisfies the following properties:
- •
Each subset of indices contains at least one kernel such that , for each .
- •
The set of indices is covered by the subsets such that .
- •
If kernels are not shared among PEMs such that ensembles are constructed using non-overlapping sets, then for .
- •
If kernels are shared among PEMs such that ensembles are constructed using overlapping sets, then for .
A product manifold of convolution kernels (-PEM) constructed at the layer of an -layer CNN, denoted by , is a product of embedded kernel submanifolds belonging to which is computed by
[TABLE]
where is the topological Cartesian product, and therefore is a product topology. Each is called a component submanifold of . A kernel is then obtained by concatenating kernels belonging to , , using , where is the cardinality of . A -PEM is called an ensemble of PEMs constructed using (5) for .
We compute a PEM using component submanifolds in (5) utilizing , and construct ensembles of PEMs using . Recall that, at each layer of an -layer CNN, we compute a convolution kernel , , , . We first choose subsets of indices of input channels and subsets of indices of output channels , such that and . Then, we propose three strategies for determination of index sets (see Figure 1);
PEMs for input channels (PI): For each input channel, we construct , where and the Cartesian product preserves the input channel index, . 2. 2.
PEMs for output channels (PO): For each output channel, we construct , where and the Cartesian product preserves the output channel index, . 3. 3.
PEMs for input and output channels (PIO): We construct , where and such that .
Example 2.2**.**
An illustration of employment of PI, PO and PIO at the layer of a CNN is given in Figure 1. Suppose that we have a kernel tensor of size where the number of input and output channels is and . In total, we have kernel matrices of size . An example of construction of an ensemble of PEMs is as follows.
PI: For each of 4 input channels, we split a set of 6 kernels associated with 6 output channels into two subsets of 3 kernels. Choosing the sphere (Sp) for the first subset, we construct a PEM as a product of 3 Sp using (5). That is, each of 3 component manifolds , of the PEM is a sphere. Similarly, choosing the Stiefel (St) for the second subset, we construct another PEM as a product of 3 St (each of 3 component manifolds , of the second PEM is a Stiefel manifold.). Thus, at this layer, we construct an ensemble of 4 PEMs of 3 St and 4 PEMs of 3 Sp. 2. 2.
PO: For each of 6 output channels, we split a set of 4 kernels corresponding to the input channels into two subsets of 2 kernels. We choose the Sp for the first subset, and we construct a PEM as a product of 2 Sp using (5). We choose the St for the second subset, and we construct a PEM as a product of 2 St. Thereby, we have an ensemble consisting of 6 PEMs of St and 6 PEMs of Sp. 3. 3.
PIO: We split the set of 24 kernels into 10 subsets. For each of 6 output channels, we split the set of kernels corresponding to the input channels into 3 subsets. We choose the Sp for 2 subsets each containing 3 kernels, and 3 subsets each containing 2 kernels. We choose the St similarly for the remaining subsets. Then, our ensemble contains 5 PEMs of St and 5 PEMs of Sp.
Our framework can be used to model both overlapping and non-overlapping sets. If ensembles are constructed using overlapping sets, then kernels having different constraints can be applied to the same input or output channels. For example, kernels belonging to a PEM of 3 St and kernels belonging to a PEM of 3 Sp can be applied to the same output (input) channel for PI (PO) in the previous example (see Figure 1). More complicated configurations can be obtained using PIO. In the experiments, we selected non-overlapping sets for simplicity. We consider theoretical and experimental analyses of overlapping sets as a future work.
3 Optimization on Ensembles of PEMs using Geometry-aware SGD in CNNs
If an SGD is employed on non-linear kernel submanifolds, then the gradient descent is generally performed by three steps; i) projection of gradients on tangent spaces of the submanifolds, ii) movement of kernels on the tangent spaces in the gradient descent direction, and iii) projection of the moved kernels onto the submanifolds [21]. These steps are determined according to the geometric properties of the submanifolds, such as sectional curvature and metric properties. For example, the Euclidean space has zero sectional curvature, i.e. it is not curved (flat). Thereby, these steps can be performed using a single step if an SGD employs kernels residing on the Euclidean space. However, if kernels belong to the unit sphere, then the kernel space is curved by constant positive curvature. Moreover, a different tangent space is computed at each kernel located on the sphere. Therefore, nonlinearity of operations and transformations applied on kernels implied by curvature and metric of kernel spaces are used for gradient descent in the aforementioned three steps. In addition, martingale properties of stochastic processes defined by kernels are determined by geodesics, metrics, gradients projected at tangent spaces and injectivity radius of kernel spaces (see proofs of Theorem 3.3 and Corollary 3.4 in the supp. mat. for details).
Geometric properties of PEMs can be different from that of the component submanifolds of PEMs, even if they are constructed using identical submanifolds. For example, we observe locally varying curvatures when we construct PEMs of spheres (see Figure 2). Kernel spaces with more complicated geometric properties can be obtained using the proposed strategies (PI, PO, PIO), especially by constructing ensembles of PEMs of non-identical submanifolds (see Section 4 for details and examples). Thus, as the complexity of geometry of kernel spaces increases, their effect on performance and convergence of SGD gradually increases.
In order to address these problems and consider geometric properties of kernel submanifolds for training of CNNs, we propose a geometry aware SGD (G-SGD). We employ metric properties of PEMs to perform gradient descent steps of G-SGD, and use curvature properties PEMs to explore convergence properties of G-SGD. We explore metric and curvature properties of PEMs in the next theorem.
Definition 3.1** (Sectional curvature of component submanifolds).**
Let denote the set of smooth vector fields on . The sectional curvature of associated with a two dimensional subspace is defined by
[TABLE]
where is the Riemannian curvature tensor333Additional definitions are given in the supp. mat., is an inner product, and form a basis of .
Lemma 3.2** (Metric and curvature properties of PEMs).**
Suppose that and are tangent vectors belonging to the tangent space computed at , . Then, tangent vectors and are computed at by concatenation as and . If each kernel submanifold is endowed with a Riemannian metric , then a -PEM is endowed with the metric computed by
[TABLE]
In addition, suppose that is the Riemannian curvature tensor field (endomorphism) [20] of , , defined by
[TABLE]
where are vector fields such that , , , and . Then, the Riemannian curvature tensor field of is computed by
[TABLE]
where and . Moreover, has never strictly positive sectional curvature in the metric (7). In addition, if is compact, then does not admit a metric with negative sectional curvature .
We compute the metric of a -PEM using the metrics identified on the component manifolds employing (7) given in Lemma 3.2. In addition, we use the Riemannian curvature and sectional curvature of the given in Lemma 3.2 to analyze convergence of our proposed G-SGD, and to compute adaptive step size.
Note that some sectional curvatures vanish on the by the lemma. For instance, suppose that each is a unit two-sphere , (see Figure 2.a). Then, computed by (5) has unit curvature along two-dimensional subspaces of its tangent spaces, called two-planes. On the other hand, has zero curvature along all two-planes spanning exactly two distinct spheres. Therefore, learning rates need to be computed adaptively according to sectional curvatures at each layer of the CNN and at each epoch of the G-SGD for each kernel on each manifold .
3.1 Optimization using G-SGD in CNNs
An algorithmic description of our proposed geometry-aware SGD (G-SGD) is given in Algorithm 12. At the initialization of the G-SGD, we identify the component embedded kernel submanifolds according to the constraints that will be applied on the kernels . For instance, we employ an orthonormalization constraint for kernels residing on dimensional unit sphere , where is the Frobenius norm [2]444In the experimental analyses, we use the oblique and the Stiefel manifolds as well as the sphere and the Euclidean space to identify subcomponent manifolds ..
When we employ a G-SGD on a -PEM , each kernel is moved on the -PEM in the descent direction of gradient of loss at each step of the G-SGD. More precisely, direction and amount of movement of a kernel are determined at the step and the layer by the following steps of Algorithm 12:
Line 6: Using Lemma 3.2, the gradient , which is obtained using back-propagation from the upper layer, is projected onto the tangent space at , where \mathcal{T}_{\omega^{t}_{G^{m}_{l}}}\mathbb{M}_{G^{m}_{l}}=\bigtimes\limits_{\iota\in\mathcal{I}_{G^{m}_{l}}}\mathcal{T}_{\omega^{t}_{\iota,l}}\mathbb{M}_{\iota}. 2. 2.
Line 7: Movement of on using computed by
[TABLE]
where is the learning rate that satisfies
[TABLE]
, , , is the geodesic distance between and a local minima on , is the sectional curvature of , which can be computed using Lemma 3.2 by
[TABLE] 3. 3.
Line 8: Projection of the moved kernel at onto the manifold using to compute , where is an exponential map, or a retraction which is an approximation of the exponential map [3].
The denominator used for computation of the step size in (10) is employed as a regularizer to control the change of gradient at each step of G-SGD. This property is examined in the experimental analyses for PEMs of different manifolds. For computation of , we use (12) utilizing Lemma 3.2. Unlike related works, kernels residing on each PEM are moved and projected jointly on the PEMs in G-SGD, by which we can employ their interaction using the corresponding gradients considering nonlinear geometry of manifolds. G-SGD can perform optimization on PEMs and their ensemble according to sets , recursively. Thereby, G-SGD can consider interactions between component manifolds as well as those between PEMs in an ensemble. SGD methods studied in the literature do not have assurance of convergence when it is applied to optimization on ensembles of PEMs. Employment of (10) and (11) at line 7, and retractions at line 8 are essential for assurance of convergence as explained next.
3.2 Convergence Properties of G-SGD
In some machine learning tasks, such as clustering [6, 24], the geodesic distance can be computed in closed form. However, a closed form solution may not be computed using CNNs due to the challenge of computation of local minima. Therefore, we provide an asymptotic convergence property for Algorithm 12 in the next theorem.
Theorem 3.3**.**
Suppose that there exists a local minimum , and such that , where is an exponential map or a twice continuously differentiable retraction, and is the inner product. The loss function and the gradient converges almost surely (a.s.) by , and , for each .
Theorem 3.3 assures convergence of the G-SGD (Algorithm 12) to minima. For implementation of G-SGD, we use the result given in Lemma 3.2 for PEMs to employ sectional curvatures. Although sectional curvatures of non-identical embedded kernel submanifolds can be different [21], Lemma 3.2 assures existence of zero sectional curvature in PEMs along their tangent spaces. In the next theorem, we provide an example for computation of a step size function for component embedded kernel submanifolds determined by the sphere using the result given in Lemma 3.2, and explore its convergence property using Theorem 3.3.
Corollary 3.4**.**
Suppose that are identified by dimensional unit sphere , and , where is an upper bound on the sectional curvatures of at . If step size is computed using (10) with
[TABLE]
then , and , for each .
In the experimental analyses, we use different step size functions and analyze convergence properties and performance of CNNs trained using G-SGD by relaxing assumptions of Theorem 3.3 and Corollary 3.4 for different CNN architectures and benchmark image classification datasets.
4 Experimental Analyses
We examine the proposed G-SGD method for training of state-of-the-art CNNs, called Residual Networks (Resnets) [9], equipped with different number of layers and kernels. We use three benchmark RGB image classification datasets, namely Cifar-10, Cifar-100 and Imagenet [18]. The Cifar-10 and Cifar-100 datasets consist of training and test images belonging to 10 and 100 classes, respectively. The Imagenet dataset consists of classes ( training and validation images).
We construct ensembles of PEMs using the sphere (Sp), the oblique (Ob) and the Stiefel (St) manifolds. We also use the kernels residing on the ambient Euclidean space of embedded kernel submanifolds (Euc.). In order to preserve the task structure (classification of RGB images), we employed PI for the layers considering the RGB space of images, PO for considering the number of classes learned at the top layer of a CNN, and PIO for . Suppose that we have a set of kernels with and at the layer of a CNN. In the construction of ensembles, we employ PI, PO and PIO using a kernel set splitting (KSS) scheme. In KSS, we split the kernel set into subsets , , where kernels belonging to reside on the PEM identified by which is determined according to PI, PO and PIO, . For the sake of simplicity of the analyses, we split the kernel set into subsets with size in KSS, while the proposed schemes enable us to construct new kernel sets with varying size. Implementation details of G-SGD for different ensembles and Resnets, data pre-processing details of the benchmark datasets and additional results are given in the supp. mat.
4.1 Analysis of Classification Performance on Benchmark Datasets
We analyze classification performance of CNNs trained using G-SGD on benchmark Cifar-10, Cifar-100 and Imagenet datasets. In order to construct ensembles of kernels belonging to Euc., Sp, St and Ob using KSS, we increase the number of kernels used in CNNs to 24 and its multiples (see the supp. mat.). We use other hyperparameters of CNNs as suggested in [9, 12, 21]. We depict performance of our implementation of CNNs for baseline geometries (Euc., Sp, St and Ob) by marker in the tables. For computation of , we used
[TABLE]
as suggested in Corollary 3.4. Implementation details are given in the supp. mat.
We examine classification performance of Resnets with 44 layers (Resnet-44) and 18 layers (Resnet-18) on Cifar-10 with data augmentation (DA) and Imagenet in Table I and Table II, respectively. The results show that performance of CNNs are boosted by employing ensembles of PEMs (denoted by PI, PO and PIO for PEMs) using G-SGD compared to the employment of baseline Euc. We observe that PEMs of component submanifolds of identical geometry (denoted by PEMs of Sp/St/Ob), and their ensembles (denoted by PI, PO, PIO for PEMs of Sp/St/Ob) provide better performance compared to employment of component submanifolds (denoted by Sp/Ob/St) [21]. For instance, we obtain , and error using PIO for PEMs of Sp, Ob and St in Table II, respectively. However, the error obtained using Sp, Ob and St is , and , respectively.
In addition, we obtain and boost of the performance by ensemble of the St with Euc. ( and using PIO for Euc.+St, respectively) for the experiments on the Cifar-10 and Imagenet datasets using the PIO scheme in Table I and Table II, respectively. Moreover, we observe that construction of ensembles using Ob performs better for PI compared to PO. For instance, we observe that PI for PEMs of Ob provides and while PO for PEMS of Ob provides and in Table I and Table II, respectively. We may associate this result with the observation that kernels belonging to Ob are used for feature selection and modeling of texture patterns with high performance [1, 21]. However, ensembles of St and Sp perform better for PO ( and in Table I and Table II) compared to PI ( and in Table I and Table II) on kernels employed on output channels.
It is also observed that PIO performs better than PI and PO in all the experiments. We observe and boost by construction of an ensemble of four manifolds (Sp+Ob+St+Euc.) using the PIO scheme in Table I ( ) and Table II ( ), respectively. In other words, ensemble methods boost the performance of large-scale CNNs more for large-scale datasets (e.g. Imagenet) consisting of larger number of samples and classes compared to the performance of smaller CNNs employed on smaller datasets (e.g. Cifar-10). This result can be attributed to enhancement of sets of features learned using multiple constraints on kernels.
We analyze this observation by examining the performance of larger CNNs consisting of 110 layers on Cifar-10 and Cifar-100 datasets with and without using DA in Table III. The results show that employment of PEMs can boost the performance of CNNs that use component submanifolds (e.g. PEMs of Sp, Ob and St) more for larger networks (Table III) compared to smaller networks (Table I and Table II). Moreover, employment of PIO for PEMs of Sp+Ob+St+Euc. boosts the performance of CNNs that use Euc. more for Cifar-100 (3.55% boost in average) compared to the performance obtained for Cifar-10 (1.58% boost in average). In addition, we observe that ensembles boost the performance of CNNs that use DA methods more compared to the performance of CNNs without using DA.
Our method fundamentally differs from network ensembles. In order to analyze the results for network ensembles of CNNs, we employed an ensemble method [9] by voting of decisions of Resnet 44 on Cifar 10. When CNNs trained on individual Euc, Sp, Ob, and St are ensembled using voting, we obtained (Euc+Sp+Ob+St) and (Sp+Ob+St) errors (see Table 1 for comparison). In our analyses of ensembles (PI, PO and PIO), each PEM contains kernels, where is the number of kernels used at the layer, and is the number of PEMs. When each CNN in the ensemble was trained using an individual manifold which contains of kernels (using as utilized in our experiments), then we obtained (Euc), (Sp), (Ob), (St), (Euc+Sp+Ob+St) and (Sp+Ob+St) errors. Thus, our proposed methods outperform ensembles constructed by voting. Additional results are given in the supplemental material.
5 Conclusion and Discussion
We introduced and elucidated a problem of training CNNs using multiple constraints employed on convolution kernels with convergence properties. Following our theoretical results, we proposed the G-SGD algorithm and adaptive step size estimation methods for optimization on ensembles of PEMs that are identified by the constraints. The experimental results show that our proposed methods can improve convergence properties and classification performance of CNNs. Overall, the results show that employment of ensembles of PEMs using G-SGD can boost the performance of larger CNNs (e.g. RCD and RSD) on large scale datasets (e.g. Imagenet) more compared to the performance of small and medium scale networks (e.g. Resnets with 16 and 44 layers) employed on smaller datasets (e.g. Cifar-10).
In future work, we plan to extend the proposed framework by development of new ensemble schemes to perform various tasks such as machine translation and video recognition using CNNs and Recurrent Neural Networks (RNNs). In addition, the proposed methods can be applied to other stochastic optimization methods such as Adam and trust region methods. We believe that our proposed framework will be useful for researchers to study geometric properties of parameter spaces of deep networks, and to improve our understanding of deep feature representations.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] P. A. Absil and K. A. Gallivan. Joint diagonalization on the oblique manifold for independent component analysis. In Proc. 31st IEEE Int. Conf. Acoust., Speech Signal Process , volume 5, pages 945–948, Toulouse, France, May 2006.
- 2[2] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds . PUP, Princeton, NJ, USA, 2007.
- 3[3] P. A. Absil and J. Malick. Projection-like retractions on matrix manifolds. SIAM Journal on Optimization , 22(1):135–158, 2012.
- 4[4] M. Arjovsky, A. Shah, and Y. Bengio. Unitary evolution recurrent neural networks. In Proc. of the 33rd Int. Conf. on Mach. Learn. ICML , pages 1120–1128, New York City, NY, USA, June 2016.
- 5[5] D. Arpit, Y. Zhou, B. U. Kota, and V. Govindaraju. Normalization propagation: A parametric technique for removing internal covariate shift in deep networks. In Proc. of the 33rd Int. Conf. on Mach. Learn. ICML , pages 1168–1176, New York City, NY, USA, June 2016.
- 6[6] S. Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Trans. Autom. Control , 58(9):2217–2229, Sept 2013.
- 7[7] M. Cho and J. Lee. Riemannian approach to batch normalization. In Advances in Neural Information Processing Systems (NIPS) , 2017.
- 8[8] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: Improving robustness to adversarial examples. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning , volume 70 of Proceedings of Machine Learning Research , pages 854–863, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
