LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models
Mojan Javaheripi, Gustavo H. de Rosa, Subhabrata Mukherjee, Shital, Shah, Tomasz L. Religa, Caio C. T. Mendes, Sebastien Bubeck, Farinaz, Koushanfar, Debadeepta Dey

TL;DR
LiteTransformerSearch introduces a training-free neural architecture search method that efficiently finds optimal Transformer models balancing performance and hardware constraints, using decoder parameters as a proxy for perplexity, applicable across diverse devices.
Contribution
The paper presents a novel training-free NAS algorithm, LTS, that leverages decoder parameter rank correlation with performance, enabling rapid, device-specific Transformer architecture optimization without training.
Findings
Achieves comparable perplexity with faster runtime and lower memory.
Runs on target devices without GPUs, reducing carbon footprint.
Outperforms 350M parameter OPT in accuracy and efficiency.
Abstract
The Transformer architecture is ubiquitously used as the building block of large-scale autoregressive language models. However, finding architectures with the optimal trade-off between task performance (perplexity) and hardware constraints like peak memory utilization and latency is non-trivial. This is exacerbated by the proliferation of various hardware. We leverage the somewhat surprising empirical observation that the number of decoder parameters in autoregressive Transformers has a high rank correlation with task performance, irrespective of the architecture topology. This observation organically induces a simple Neural Architecture Search (NAS) algorithm that uses decoder parameters as a proxy for perplexity without need for any model training. The search phase of our training-free algorithm, dubbed Lightweight Transformer Search (LTS), can be run directly on target devices since…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Speech Recognition and Synthesis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Discriminative Fine-Tuning · Weight Decay · Attention Dropout · GPT-2 · Linear Layer · Dense Connections · Residual Connection
