From Code to Prediction: Fine-Tuning LLMs for Neural Network Performance Classification in NNGPT
Mahmoud Hanouneh, Radu Timofte, Dmitry Ignatov

TL;DR
This paper demonstrates that fine-tuned LLMs can predict which neural network architecture performs better across datasets, using code-only prompts that outperform metadata-based prompts, indicating rich information in architecture source code.
Contribution
It introduces a novel classification task within NNGPT, showing that LLMs can reason about neural network performance from code alone, surpassing metadata-based approaches.
Findings
Code-only prompts achieve 80% accuracy in predicting dataset performance.
Metadata prompts reach 70% accuracy, excelling with distinctive dataset properties.
Model capacity influences the effectiveness of architectural reasoning.
Abstract
Automated Machine Learning (AutoML) frameworks increasingly leverage Large Language Models (LLMs) for tasks such as hyperparameter optimization and neural architecture code generation. However, current LLM-based approaches focus on generative outputs and evaluate them by training the produced artifacts. Whether LLMs can learn to reason about neural network performance across datasets remains underexplored. We present a classification task integrated into the NNGPT framework, in which a fine-tuned LLM predicts which of two image classification datasets a given neural network architecture achieves higher accuracy on. The task is built on the LEMUR dataset, which provides standardized PyTorch implementations with reproducible performance metrics. Three prompt configurations of increasing difficulty are evaluated: a normalized-accuracy baseline (trivially reaching 100%), a metadata-enriched…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
