Doctor of Crosswise: Reducing Over-parametrization in Neural Networks
J. D. Curt\'o, I. C. Zarza, Kris Kitani, Irwin King and, Michael R. Lyu

TL;DR
This paper introduces a novel neural network architecture called Doctor of Crosswise that aims to reduce over-parametrization by leveraging learned weights for more efficient computation, with detailed formalism and theoretical insights.
Contribution
It presents a new architecture and formal framework to decrease over-parametrization in neural networks, enhancing computational efficiency.
Findings
Reduced over-parametrization demonstrated in experiments
Theoretical analysis confirms improved efficiency
Framework enables faster computation in deep learning models
Abstract
Dr. of Crosswise proposes a new architecture to reduce over-parametrization in Neural Networks. It introduces an operand for rapid computation in the framework of Deep Learning that leverages learned weights. The formalism is described in detail providing both an accurate elucidation of the mechanics and the theoretical implications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Neural Networks and Applications · Computational Physics and Python Applications
MethodsMCKERNEL
\settopmatter
printacmref=false
\authorsaddresses
{curto,zarza,king,lyu}@cse.cuhk.edu.hk, [email protected]
{teaserfigure}
Dr. of Crosswise. Visual description of .
Doctor of Crosswise: Reducing Over-parametrization in Neural Networks
J. D. Curtó ∗,1,2,3,4, I. C. Zarza ∗,1,2,3,4, K. Kitani2, I. King1, and M. R. Lyu1
\institution
1The Chinese University of Hong Kong. 2Carnegie Mellon.
3Eidgenössische Technische Hochschule Zürich. 4City University of Hong Kong.
*∗*Both authors contributed equally.
Abstract.
Dr. of Crosswise proposes a new architecture to reduce over-parametrization in Neural Networks. It introduces an operand for rapid computation in the framework of Deep Learning that leverages learned weights. The formalism is described in detail providing both an accurate elucidation of the mechanics and the theoretical implications.
Key words and phrases:
Deep Learning, Over-parametrization.
{CCSXML}
<ccs2012> <concept_id>10010147.10010371.10010382.10010383</concept_id> <concept_desc>Computing methodologies Neural Networks</concept_desc> <concept_significance>500</concept_significance> </concept> </ccs2012>\ccsdesc[500]Neural Networks
1. Introduction
Re-thinking how Deep Networks operate at large scale using current techniques is difficult. We build on the work in Curtó et al. [2017b] to propose a fast variant named Dr. of Crosswise111Code is available at https://www.github.com/curto2/dr_of_crosswise/. Our approach does not lose the ability to generalize while reduces vastly the number of parameters and improves training and testing speed.
Dr. of Crosswise substitutes common products matrix-vector in the architectures of Neural Networks by the use of simplified one-dimensional multiplication of tensors. Namely, it establishes a framework of learning where learned weight matrices are diagonal and optimized weights are considerably reduced. We introduce a new operation between vectors to allow for a fast computation.
The structure of learning using matrices diagonal follows directly from the construction McKernel in Curtó et al. [2017b] based on Le et al. [2013]; Rahimi and Recht [2008, 2007], which can be understood as a GAUSSIAN network Poggio et al. [2017]; Mhaskar et al. [2017] of the form
[TABLE]
We extend this idea to a more general framework, unconstrained in the sense that we no longer consider a form Gaussian, Equation 1, but any possible non-linearity (for instance ReLU). That is to say, Deep Learning.
2. Deep Learning
Successes in Neural Networks range across all domains, from Natural Language Processing Mikolov et al. [2013]; Pennington et al. [2014]; Devlin et al. [2018] to Computer Vision Goodfellow et al. [2014]; Long et al. [2014]; Ren et al. [2015]; Simonyan and Zisserman [2015]; Girshick [2015]; Radford et al. [2016]; Yu and Koltun [2016]; Salimans et al. [2016]; He et al. [2016, 2017]; Karras et al. [2018]; Curtó et al. [2017a], going through Automatic Structural Learning Cortes et al. [2017] or Data Augmentation Cubuk et al. [2018]. Techniques to improve aspects of current architectures have been widely explored Tremblay et al. [2018]; Acuna et al. [2018]; Coleman et al. [2017]. Significant progress has also been made by combining Deep Learning for extraction of features with Reinforcement Learning Duan et al. [2016]; Levine et al. [2016].
3. Doctor of Crosswise
The way to go on Deep Learning entangles the following formalism
[TABLE]
where matrix and vector are the weights and biases learned by assignment of credit Bengio and Frasconi [1993]; Lecun et al. [1998], videlicet backpropagation.
Dr. of Crosswise substitutes products matrix-vector by
[TABLE]
where is a matrix with form diagonal
[TABLE]
We can now factorize Equation 3 by the use of a new operand between vectors
[TABLE]
where is a vector that holds the diagonal of , .
The operand defined by is similar to a HADAMARD product and does the following operation. Given vectors and it computes
[TABLE]
where holds the non-zero elements of a matrix diagonal.
Note that Equation 5 is equivalent to 3. In other words, the new operator helps factorize products between matrix diagonal and vector in form vector-vector. All the notation is preserved but for a change in the definition of the product.
We can understand this new operation between vectors as a product component-wise that scales each component by a given factor, Figure Doctor of Crosswise: Reducing Over-parametrization in Neural Networks.
Furthermore, the factors that need to be learned reduce from to for each given layer, a momentous reduction of learned parameters and time of computation. Considering the compositionality of the problem, where architectures are build as a stack of many of these formalisms, Equation 2, the improvements are considerably remarkable.
3.1. Extension to Higher-order
If we now consider the setting where we have multiple input data and the need to expand it to higher-dimensional spaces from layer to layer, as it is normally done in Deep Learning. We have to consider several facts. Given a matrix of dimension and input vector of dimension we will generate an output vector with size proportional to . will be substituted by a set of matrices diagonal whose cardinal will be the closest multiplicity of to , see Figure 1. In this way, we will need a product that operates element-wise between the elements of each matrix diagonal and the input vector. Considering a varying number of input vectors, this mathematical operation is very close to a KHATRI RAO product, which is the matching columnwise KRONECKER product Kolda and Bader [2009].
These products between matrices have useful properties:
[TABLE]
where denotes KHATRI RAO product, specifies KRONECKER product and means HADAMARD product. is the MOORE PENROSE pseudo-inverse of .
Take special note on the fact that for example, to transform an input vector of size , to an output vector of size , in the standard formulation of Neural Networks we have to learn weights. If we consider Dr. of Crosswise, the number of learned parameters to do exactly the same operation reduces to . More importantly, the number of multiplications and additions also drastically lowers.
4. Rationale
We start with an exordium on Kernel Methods Cortes and Vapnik [1995]; Vapnik and Izmailov [2018]. Let be a measure space with . We name a kernel if, and only if, there is some feature map into a separable HILBERT space such that
[TABLE]
In other words, is a kernel if, and only if, for some space and map the following diagram commutes:
{X\times X}$${\mathbb{H}\times\mathbb{H}}$${\mathbb{R}.}$$\scriptstyle\phi$$\scriptstyle k$$\scriptstyle\langle\cdot,\cdot\rangle_{\mathbb{H}}
Random Kitchen Sinks Rahimi and Recht [2007] approximate this mapping of features by a FOURIER expansion in the case of RBF kernels
[TABLE]
Le et al. [2013]; Yang et al. [2014] present a fast approximation of the matrix . Recent works on the matter Wu et al. [2016]; Yang et al. [2015]; Moczulski et al. [2016]; Hong et al. [2017] use this technique to extract meaningful features.
Dr. of Crosswise is motivated by the construction McKernel in Curtó et al. [2017b], where Deep Learning and Kernel Methods are unified by extending the use of the approximation of , , to a SGD optimization setting
[TABLE]
Here and are matrices diagonal, is a random matrix of permutation and is the Hadamard. Whenever the number of rows in exceeds the dimensionality of the data, simply generate multiple instances of , drawn i.i.d., until the required number of dimensions is obtained.
The key idea behind it is that the FOURIER transform diagonalizes the integral operator.
This can be best seen as follows. Considering the operator of convolution working on the complex exponential
[TABLE]
Then
[TABLE]
where and is the FOURIER transform of .
We build on these ideas to pioneer a framework of Deep Learning where the formalism used entangles matrices diagonal.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1]
- 2Acuna et al . [2018] D. Acuna, H. Ling, A. Kar, and S. Fidler. 2018. Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++. CVPR (2018).
- 3Bengio and Frasconi [1993] Y. Bengio and P. Frasconi. 1993. Credit Assignment through Time: Alternatives to Backpropagation. NIPS (1993).
- 4Coleman et al . [2017] C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. Ré, and M. Zaharia. 2017. DAWN Bench: An End-to-end Deep Learning Benchmark and Competition. NIPS (2017).
- 5Cortes et al . [2017] C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and S. Yang. 2017. Ada Net: Adaptive Structural Learning of Artificial Neural Networks. ICML (2017).
- 6Cortes and Vapnik [1995] C. Cortes and V. Vapnik. 1995. Support Vector Networks. Machine Learning (1995).
- 7Cubuk et al . [2018] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. 2018. Auto Augment: Learning Augmentation Policies from Data. ar Xiv:1805.09501 (2018).
- 8Curtó et al . [2017 a] J. D. Curtó, I. C. Zarza, F. Torre, I. King, and M. R. Lyu. 2017 a. High-resolution Deep Convolutional Generative Adversarial Networks. ar Xiv:1711.06491 (2017).
