LLM Performance Predictors are good initializers for Architecture Search
Ganesh Jawahar, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Dujian, Ding

TL;DR
This paper explores using Large Language Models to create performance predictors for neural network architectures, enabling faster and efficient neural architecture search with minimal accuracy trade-offs.
Contribution
It introduces a novel method of constructing performance predictors with LLMs, distills them into lightweight models, and develops a hybrid NAS algorithm that reduces search time significantly.
Findings
LLM-based predictors achieve state-of-the-art accuracy in machine translation tasks.
Distilled predictors retain much of the original LLM predictor's performance.
Hybrid NAS reduces search hours by about 50% while maintaining or improving model quality.
Abstract
In this work, we utilize Large Language Models (LLMs) for a novel use case: constructing Performance Predictors (PP) that estimate the performance of specific deep neural network architectures on downstream tasks. We create PP prompts for LLMs, comprising (i) role descriptions, (ii) instructions for the LLM, (iii) hyperparameter definitions, and (iv) demonstrations presenting sample architectures with efficiency metrics and `training from scratch' performance. In machine translation (MT) tasks, GPT-4 with our PP prompts (LLM-PP) achieves a SoTA mean absolute error and a slight degradation in rank correlation coefficient compared to baseline predictors. Additionally, we demonstrate that predictions from LLM-PP can be distilled to a compact regression model (LLM-Distill-PP), which surprisingly retains much of the performance of LLM-PP. This presents a cost-effective alternative for…
Peer Reviews
Decision·Submitted to ICLR 2024
- Innovative use of LLMs for the purpose of performance prediction. - The introduction of LLM-Distill-PP and the Hybrid-Search algorithm significantly reduces the latency in searching for architectures. - Extensive experiments demonstrate the efficiency of the proposed methods, highlighting their practicality.
- The paper could benefit from a more in-depth exploration of the validation methods used. The explanations in Sections 3 and 4 do not clearly articulate the problem statement and baseline comparisons. - While the concept of distillation is critical, the paper's narrative feels disjointed. The scientific discourse between Chapters 5 and 6 appears fragmented and could be more cohesively presented. - The figures require refinement; the font aesthetics are lacking, particularly in Figure 2. Algorit
Using LLMs for performance prediction is interesting and fairly novel. Since LLMs are trained on the whole internet, with an emphasis on code, it is reasonable that an LLM would have an idea on the performance of architectures, especially well-known architectures. The authors use the LLM-based supernet at the start of training, and then replace with a supernet. This fits the intuition that LLMs are strongest at performance prediction early on, but are no match for computational-based methods a
Overall, I am concerned that the paper is a bit too narrow in a few parts. **Comparison to other methods.** The authors use three baselines, all of which are supernetwork-based performance predictors. The authors also make the statement, “The SOTA approach for building performance predictors (f_T ) is to train a weight-sharing supernet model on the task T.” It is highly unclear that this sentence is true. There are many different types of performance predictors, such as zero-cost proxies and le
This paper presents an interesting method to predict model performance on a common model architecture, such as the transformer-base encoder-decoder version, and on a common dataset like WMT'14.
The effectiveness of the proposed method largely depends on how much information GPT-4 has "memorized." Since GPT-4 is a language model, its impressive prediction performance on WMT'14 (or WMT'19), transformer-base, translation direction, and BLEU is primarily because **these elements are commonly used for machine translation**. The authors need to recognize the limitations when dealing with less conventional models, datasets, translation directions, metrics, and other tasks and discuss these in
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Machine Learning in Materials Science
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Byte Pair Encoding · Dropout · Layer Normalization
