Maestro: Uncovering Low-Rank Structures via Trainable Decomposition

Samuel Horvath; Stefanos Laskaridis; Shashank Rajput; Hongyi Wang

arXiv:2308.14929·cs.LG·June 17, 2024·1 cites

Maestro: Uncovering Low-Rank Structures via Trainable Decomposition

Samuel Horvath, Stefanos Laskaridis, Shashank Rajput, Hongyi Wang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

Maestro introduces a trainable low-rank decomposition framework for DNNs that integrates low-rank structures into training, enabling efficient model compression and flexible accuracy-latency trade-offs without heavy iterative decompositions.

Contribution

It proposes LoD, a novel importance sampling-based low-rank ordering method, allowing layer-wise rank selection and unifying low-rank approximation with training.

Findings

01

Enables extraction of lower footprint models with preserved performance

02

Allows accuracy-latency trade-offs without retraining

03

Recovers SVD and PCA in special cases

Abstract

Deep Neural Networks (DNNs) have been a large driver for AI breakthroughs in recent years. However, these models have been getting increasingly large as they become more accurate and safe. This means that their training becomes increasingly costly and time-consuming and typically yields a single model to fit all targets. Various techniques have been proposed in the literature to mitigate this, including pruning, sparsification, or quantization of model weights and updates. While achieving high compression rates, they often incur significant computational overheads at training or lead to non-negligible accuracy penalty. Alternatively, factorization methods have been leveraged for low-rank compression of DNNs. Similarly, such techniques (e.g., SVD) frequently rely on heavy iterative decompositions of layers and are potentially sub-optimal for non-linear models, such as DNNs. We take a…

Peer Reviews

Decision·ICML 2024 Poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

- The paper extends the Ordered Dropout technique to handle non-uniformity in the search space by allowing different ranks per layer. - It introduces a trainable aspect to the decomposition, which enables the model to reflect the data distribution. - It provides a latency-accuracy trade-off mechanism for deploying the network on constrained devices.

Weaknesses

- The citation style seems not correct. It should include the author's names in place of numerical references. - Why the method named after "Maestro"? It is never introduced and seems weird to me. - The proposed technique appears as a logical improvement from Ordered Dropout. Its effectiveness, however, is primarily demonstrated through toy architectures and datasets, such as ResNet18 and Cifar10. For the method to gain practical and impactful validation, I recommend conducting additional expe

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

Unlike unstructured pruning methods, low-rank compression can preserve the dense structure of matrices, which can extract more performance from GPUs. For the training of transformers on the Multi30k dataset shown in Table 3, the proposed method is able to reduce the number of parameters by more than half compared to the baseline (Pufferfish), while also reducing the perplexity.

Weaknesses

Low-rank compression and Lasso have been around for a very long time, and the only novelty seems to be the use of ordered dropout. The improvement over existing methods is marginal for the experiments with CNNs. The proposed method is obviously very sensitive to the choice of the Lasso coefficient lambda, but there is no theory behind how it can be chosen effectively.

Reviewer 03Rating 8· accept, good paperConfidence 3

Strengths

The novelty of the work lies in applying the existing Ordered Dropout technique from Federated Learning (FjORD) to optimally order the heterogeneous ranks of various layers in DNNs based on importance criterion, which results in discovering layer-wise low-rank decompositions. In contrast to uniform dropout across the width in each layer ( FjORD), MAESTRO independently decomposes each layer to uncover optimal rank. The authors provide applications of MAESTRO to various layer types in CNNs, FC, an

Weaknesses

1. The paper suffers from typos. The authors are encouraged to review and proofread the draft. - Page 1: …*find progressively*… - Page 2: …*novelly fuse*… - Page 3: ..*have been proposed*… (multiple instances) - Page 4: …*HMA*…. - Page 5: ….*orthoghonal*…. 2. It is recommended that authors explore a better illustration for Figure 1. For instance, there is not much difference visually in Factorized mapping and Ordered Representation when printed in black/white. It might be helpful to provide a b

Code & Models

Repositories

samuelhorvath/maestro-lod
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Advanced Neural Network Applications · Medical Image Segmentation Techniques

MethodsDropout · Principal Components Analysis