Distilling Optimal Neural Networks: Rapid Search in Diverse Spaces

Bert Moons; Parham Noorzad; Andrii Skliar; Giovanni Mariani; Dushyant; Mehta; Chris Lott; Tijmen Blankevoort

arXiv:2012.08859·cs.LG·August 30, 2021

Distilling Optimal Neural Networks: Rapid Search in Diverse Spaces

Bert Moons, Parham Noorzad, Andrii Skliar, Giovanni Mariani, Dushyant, Mehta, Chris Lott, Tijmen Blankevoort

PDF

Open Access

TL;DR

DONNA is a scalable, rapid neural architecture search pipeline that efficiently finds diverse, hardware-aware neural network models using a knowledge distillation-based accuracy predictor and evolutionary search, outperforming existing methods in speed and efficiency.

Contribution

We introduce DONNA, a novel NAS framework that combines knowledge distillation, evolutionary search, and rapid finetuning to enable scalable, diverse, and hardware-aware neural network design.

Findings

01

DONNA is up to 100x faster than MNasNet in architecture search.

02

DONNA architectures are 20% faster than EfficientNet-B0 on Nvidia V100.

03

DONNA achieves 10% faster inference with slightly higher accuracy than MobileNetV2-1.4x on a smartphone.

Abstract

Current state-of-the-art Neural Architecture Search (NAS) methods neither efficiently scale to multiple hardware platforms, nor handle diverse architectural search-spaces. To remedy this, we present DONNA (Distilling Optimal Neural Network Architectures), a novel pipeline for rapid, scalable and diverse NAS, that scales to many user scenarios. DONNA consists of three phases. First, an accuracy predictor is built using blockwise knowledge distillation from a reference model. This predictor enables searching across diverse networks with varying macro-architectural parameters such as layer types and attention mechanisms, as well as across micro-architectural parameters such as block repeats and expansion rates. Second, a rapid evolutionary search finds a set of pareto-optimal architectures for any scenario using the accuracy predictor and on-device measurements. Third, optimal models are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Machine Learning and ELM

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Knowledge Distillation · Softmax · Sigmoid Activation · Dropout · Dense Connections · Squeeze-and-Excitation Block · Global Average Pooling · MnasNet · Pointwise Convolution