SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

Bohan Lyu; Siqiao Huang; Zichen Liang; Qi-An Sun; Jiaming Zhang

arXiv:2502.11167·cs.LG·September 30, 2025

SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

Bohan Lyu, Siqiao Huang, Zichen Liang, Qi-An Sun, Jiaming Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces SURGE, a comprehensive benchmark to evaluate large language models as surrogate code executors across diverse programming and computational tasks, revealing their potential and limitations.

Contribution

It systematically assesses LLMs' ability to predict code execution outcomes, providing new insights into their effectiveness as surrogate models for various complex programming tasks.

Findings

01

LLMs show promising capabilities in code prediction tasks.

02

Scaling laws influence LLM performance in code execution.

03

Data efficiency varies across different programming challenges.

Abstract

Neural surrogate models are powerful and efficient tools in data mining. Meanwhile, large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as generation and understanding. However, an equally important yet underexplored question is whether LLMs can serve as surrogate models for code execution prediction. To systematically investigate it, we introduce SURGE, a comprehensive benchmark with $1160$ problems covering $8$ key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. Through extensive analysis of $21$ open-source and proprietary LLMs, we examine scaling laws, data efficiency,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

imbernoulli/surge
noneOfficial

Videos

Surge: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Model-Driven Software Engineering Techniques