SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors
Bohan Lyu, Siqiao Huang, Zichen Liang, Qi-An Sun, Jiaming Zhang

TL;DR
This paper introduces SURGE, a comprehensive benchmark to evaluate large language models as surrogate code executors across diverse programming and computational tasks, revealing their potential and limitations.
Contribution
It systematically assesses LLMs' ability to predict code execution outcomes, providing new insights into their effectiveness as surrogate models for various complex programming tasks.
Findings
LLMs show promising capabilities in code prediction tasks.
Scaling laws influence LLM performance in code execution.
Data efficiency varies across different programming challenges.
Abstract
Neural surrogate models are powerful and efficient tools in data mining. Meanwhile, large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as generation and understanding. However, an equally important yet underexplored question is whether LLMs can serve as surrogate models for code execution prediction. To systematically investigate it, we introduce SURGE, a comprehensive benchmark with problems covering key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. Through extensive analysis of open-source and proprietary LLMs, we examine scaling laws, data efficiency,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Model-Driven Software Engineering Techniques
