Doctor of Crosswise: Reducing Over-parametrization in Neural Networks

J. D. Curt\'o; I. C. Zarza; Kris Kitani; Irwin King and; Michael R. Lyu

arXiv:1905.10324·cs.LG·April 20, 2020

Doctor of Crosswise: Reducing Over-parametrization in Neural Networks

J. D. Curt\'o, I. C. Zarza, Kris Kitani, Irwin King and, Michael R. Lyu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel neural network architecture called Doctor of Crosswise that aims to reduce over-parametrization by leveraging learned weights for more efficient computation, with detailed formalism and theoretical insights.

Contribution

It presents a new architecture and formal framework to decrease over-parametrization in neural networks, enhancing computational efficiency.

Findings

01

Reduced over-parametrization demonstrated in experiments

02

Theoretical analysis confirms improved efficiency

03

Framework enables faster computation in deep learning models

Abstract

Dr. of Crosswise proposes a new architecture to reduce over-parametrization in Neural Networks. It introduces an operand for rapid computation in the framework of Deep Learning that leverages learned weights. The formalism is described in detail providing both an accurate elucidation of the mechanics and the theoretical implications.

Equations28

G (x) = k = 1 \sum N a_{k} exp (- ∣ x - x_{k} ∣^{2}), x \in R^{d} .

G (x) = k = 1 \sum N a_{k} exp (- ∣ x - x_{k} ∣^{2}), x \in R^{d} .

y = max (W x + b, 0)

y = max (W x + b, 0)

y = max (\hat{W} x + b, 0)

y = max (\hat{W} x + b, 0)

\hat{W} = W_{1} 0 ⋮ 0 0 W_{2} ⋱ \dots \dots ⋱ ⋱ 0 0 ⋮ 0 W_{n} .

\hat{W} = W_{1} 0 ⋮ 0 0 W_{2} ⋱ \dots \dots ⋱ ⋱ 0 0 ⋮ 0 W_{n} .

y = max ({c, x}_{c&z} + b, 0)

y = max ({c, x}_{c&z} + b, 0)

{c, x}_{c&z} = c_{1} x_{1} c_{2} x_{2} ⋮ c_{n} x_{n}

{c, x}_{c&z} = c_{1} x_{1} c_{2} x_{2} ⋮ c_{n} x_{n}

(A \otimes B) (C \otimes D) = A C \otimes B D

(A \otimes B) (C \otimes D) = A C \otimes B D

(A \otimes B)^{†} = A^{†} \otimes B^{†}

A ⊙ B ⊙ C = (A ⊙ B) ⊙ C = A ⊙ (B ⊙ C)

(A ⊙ B)^{T} (A ⊙ B) = A^{T} A * B^{T} B

(A ⊙ B)^{†} = ((A^{T} A) * (B^{T} B))^{†} (A ⊙ B)^{T}

k (x, x^{'}) = ⟨ ϕ (x), ϕ (x^{'}) ⟩_{H} .

k (x, x^{'}) = ⟨ ϕ (x), ϕ (x^{'}) ⟩_{H} .

k (x, x^{'}) = \int exp (i ⟨ w, x ⟩) exp (- i ⟨ w, x^{'} ⟩) d ρ (w) .

k (x, x^{'}) = \int exp (i ⟨ w, x ⟩) exp (- i ⟨ w, x^{'} ⟩) d ρ (w) .

\hat{Z} := \frac{1}{σ n} C H G Π H B .

\hat{Z} := \frac{1}{σ n} C H G Π H B .

g (z) = (f * r) (z) = \int_{- \infty}^{\infty} f (τ) r (z - τ) d τ .

g (z) = (f * r) (z) = \int_{- \infty}^{\infty} f (τ) r (z - τ) d τ .

g (z) = λ f (z)

g (z) = λ f (z)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

curto2/dr_of_crosswise
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks · Neural Networks and Applications · Computational Physics and Python Applications

MethodsMCKERNEL

Full text

\settopmatter

printacmref=false

\authorsaddresses

{curto,zarza,king,lyu}@cse.cuhk.edu.hk, [email protected]

decurto.tw dezarza.tw

{teaserfigure}

Dr. of Crosswise. Visual description of $\mathop{\rm c\&z}\nolimits$ .

Doctor of Crosswise: Reducing Over-parametrization in Neural Networks

J. D. Curtó ∗,1,2,3,4, I. C. Zarza ∗,1,2,3,4, K. Kitani2, I. King1, and M. R. Lyu1

\institution

1The Chinese University of Hong Kong. 2Carnegie Mellon.

3Eidgenössische Technische Hochschule Zürich. 4City University of Hong Kong.

*∗*Both authors contributed equally.

Abstract.

Dr. of Crosswise proposes a new architecture to reduce over-parametrization in Neural Networks. It introduces an operand for rapid computation in the framework of Deep Learning that leverages learned weights. The formalism is described in detail providing both an accurate elucidation of the mechanics and the theoretical implications.

Key words and phrases:

Deep Learning, Over-parametrization.

{CCSXML}

<ccs2012> <concept_id>10010147.10010371.10010382.10010383</concept_id> <concept_desc>Computing methodologies Neural Networks</concept_desc> <concept_significance>500</concept_significance> </concept> </ccs2012>

\ccsdesc[500]Neural Networks

1. Introduction

Re-thinking how Deep Networks operate at large scale using current techniques is difficult. We build on the work in Curtó et al. [2017b] to propose a fast variant named Dr. of Crosswise111Code is available at https://www.github.com/curto2/dr_of_crosswise/. Our approach does not lose the ability to generalize while reduces vastly the number of parameters and improves training and testing speed.

Dr. of Crosswise substitutes common products matrix-vector in the architectures of Neural Networks by the use of simplified one-dimensional multiplication of tensors. Namely, it establishes a framework of learning where learned weight matrices are diagonal and optimized weights are considerably reduced. We introduce a new operation between vectors to allow for a fast computation.

The structure of learning using matrices diagonal follows directly from the construction McKernel in Curtó et al. [2017b] based on Le et al. [2013]; Rahimi and Recht [2008, 2007], which can be understood as a GAUSSIAN network Poggio et al. [2017]; Mhaskar et al. [2017] of the form

[TABLE]

We extend this idea to a more general framework, unconstrained in the sense that we no longer consider a form Gaussian, Equation 1, but any possible non-linearity (for instance ReLU). That is to say, Deep Learning.

2. Deep Learning

Successes in Neural Networks range across all domains, from Natural Language Processing Mikolov et al. [2013]; Pennington et al. [2014]; Devlin et al. [2018] to Computer Vision Goodfellow et al. [2014]; Long et al. [2014]; Ren et al. [2015]; Simonyan and Zisserman [2015]; Girshick [2015]; Radford et al. [2016]; Yu and Koltun [2016]; Salimans et al. [2016]; He et al. [2016, 2017]; Karras et al. [2018]; Curtó et al. [2017a], going through Automatic Structural Learning Cortes et al. [2017] or Data Augmentation Cubuk et al. [2018]. Techniques to improve aspects of current architectures have been widely explored Tremblay et al. [2018]; Acuna et al. [2018]; Coleman et al. [2017]. Significant progress has also been made by combining Deep Learning for extraction of features with Reinforcement Learning Duan et al. [2016]; Levine et al. [2016].

3. Doctor of Crosswise

The way to go on Deep Learning entangles the following formalism

[TABLE]

where matrix $W$ and vector $b$ are the weights and biases learned by assignment of credit Bengio and Frasconi [1993]; Lecun et al. [1998], videlicet backpropagation.

Dr. of Crosswise substitutes products matrix-vector by

[TABLE]

where $\hat{W}$ is a matrix with form diagonal

[TABLE]

We can now factorize Equation 3 by the use of a new operand between vectors $\{\cdot,\cdot\}_{\mathop{\rm c\&z}\nolimits}$

[TABLE]

where $z$ is a vector that holds the diagonal of $\hat{W}$ , $c=[W_{1}\dots W_{N}]$ .

The operand defined by $\{\cdot,\cdot\}_{\mathop{\rm c\&z}\nolimits}$ is similar to a HADAMARD product and does the following operation. Given vectors $c$ and $x$ it computes

[TABLE]

where $c$ holds the non-zero elements of a matrix diagonal.

Note that Equation 5 is equivalent to 3. In other words, the new operator $\mathop{\rm c\&z}\nolimits$ helps factorize products between matrix diagonal and vector in form vector-vector. All the notation is preserved but for a change in the definition of the product.

We can understand this new operation between vectors as a product component-wise that scales each component by a given factor, Figure Doctor of Crosswise: Reducing Over-parametrization in Neural Networks.

Furthermore, the factors that need to be learned reduce from $n\times n$ to $n$ for each given layer, a momentous reduction of learned parameters and time of computation. Considering the compositionality of the problem, where architectures are build as a stack of many of these formalisms, Equation 2, the improvements are considerably remarkable.

3.1. Extension to Higher-order

If we now consider the setting where we have multiple input data and the need to expand it to higher-dimensional spaces from layer to layer, as it is normally done in Deep Learning. We have to consider several facts. Given a matrix $W$ of dimension $M\times N$ and input vector $x$ of dimension $N$ we will generate an output vector with size proportional to $N$ . $W$ will be substituted by a set of matrices diagonal whose cardinal will be the closest multiplicity of $N$ to $M$ , see Figure 1. In this way, we will need a product that operates element-wise between the elements of each matrix diagonal and the input vector. Considering a varying number of input vectors, this mathematical operation is very close to a KHATRI RAO product, which is the matching columnwise KRONECKER product Kolda and Bader [2009].

These products between matrices have useful properties:

[TABLE]

where $\odot$ denotes KHATRI RAO product, $\otimes$ specifies KRONECKER product and $*$ means HADAMARD product. $A\dagger$ is the MOORE PENROSE pseudo-inverse of $A$ .

Take special note on the fact that for example, to transform an input vector of size $4$ , to an output vector of size $8$ , in the standard formulation of Neural Networks we have to learn $32$ weights. If we consider Dr. of Crosswise, the number of learned parameters to do exactly the same operation reduces to $8$ . More importantly, the number of multiplications and additions also drastically lowers.

4. Rationale

We start with an exordium on Kernel Methods Cortes and Vapnik [1995]; Vapnik and Izmailov [2018]. Let $X$ be a measure space with $k:L^{2}(X\times X)\to\mathbb{R}$ . We name $k$ a kernel if, and only if, there is some feature map $\phi:X\to\mathbb{H}$ into a separable HILBERT space $\mathbb{H}$ such that

[TABLE]

In other words, $k$ is a kernel if, and only if, for some space $\mathbb{H}$ and map $\phi$ the following diagram commutes:

${X\times X}$${\mathbb{H}\times\mathbb{H}}$${\mathbb{R}.}$$\scriptstyle\phi$$\scriptstyle k$$\scriptstyle\langle\cdot,\cdot\rangle_{\mathbb{H}}$

Random Kitchen Sinks Rahimi and Recht [2007] approximate this mapping of features $\phi$ by a FOURIER expansion in the case of RBF kernels

[TABLE]

Le et al. [2013]; Yang et al. [2014] present a fast approximation of the matrix $W$ . Recent works on the matter Wu et al. [2016]; Yang et al. [2015]; Moczulski et al. [2016]; Hong et al. [2017] use this technique to extract meaningful features.

Dr. of Crosswise is motivated by the construction McKernel in Curtó et al. [2017b], where Deep Learning and Kernel Methods are unified by extending the use of the approximation of $W$ , $\hat{Z}$ , to a SGD optimization setting

[TABLE]

Here $C,G$ and $B$ are matrices diagonal, $\Pi$ is a random matrix of permutation and $H$ is the Hadamard. Whenever the number of rows in $W$ exceeds the dimensionality of the data, simply generate multiple instances of $\hat{Z}$ , drawn i.i.d., until the required number of dimensions is obtained.

The key idea behind it is that the FOURIER transform diagonalizes the integral operator.

This can be best seen as follows. Considering the operator of convolution working on the complex exponential $f(z)=\exp(ikz)$

[TABLE]

Then

[TABLE]

where $\lambda=R(k)$ and $R$ is the FOURIER transform of $r$ .

We build on these ideas to pioneer a framework of Deep Learning where the formalism used entangles matrices diagonal.

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1]
2Acuna et al . [2018] D. Acuna, H. Ling, A. Kar, and S. Fidler. 2018. Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++. CVPR (2018).
3Bengio and Frasconi [1993] Y. Bengio and P. Frasconi. 1993. Credit Assignment through Time: Alternatives to Backpropagation. NIPS (1993).
4Coleman et al . [2017] C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. Ré, and M. Zaharia. 2017. DAWN Bench: An End-to-end Deep Learning Benchmark and Competition. NIPS (2017).
5Cortes et al . [2017] C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and S. Yang. 2017. Ada Net: Adaptive Structural Learning of Artificial Neural Networks. ICML (2017).
6Cortes and Vapnik [1995] C. Cortes and V. Vapnik. 1995. Support Vector Networks. Machine Learning (1995).
7Cubuk et al . [2018] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. 2018. Auto Augment: Learning Augmentation Policies from Data. ar Xiv:1805.09501 (2018).
8Curtó et al . [2017 a] J. D. Curtó, I. C. Zarza, F. Torre, I. King, and M. R. Lyu. 2017 a. High-resolution Deep Convolutional Generative Adversarial Networks. ar Xiv:1711.06491 (2017).