Code Simulation as a Proxy for High-order Tasks in Large Language Models

Emanuele La Malfa; Christoph Weinhuber; Orazio Torre; Fangru Lin; X. Angelo Huang; Samuele Marro; Anthony Cohn; Nigel Shadbolt; Michael Wooldridge

arXiv:2502.03568·cs.LG·July 8, 2025

Code Simulation as a Proxy for High-order Tasks in Large Language Models

Emanuele La Malfa, Christoph Weinhuber, Orazio Torre, Fangru Lin, X. Angelo Huang, Samuele Marro, Anthony Cohn, Nigel Shadbolt, Michael Wooldridge

PDF

Open Access

TL;DR

This paper investigates using synthetic programming tasks as a scalable proxy to evaluate and understand the reasoning capabilities of large language models, revealing their strengths and fragilities.

Contribution

It introduces synthetic datasets based on programming constructs to assess LLM reasoning, providing a scalable alternative to handcrafted tasks and analyzing their limitations.

Findings

01

LLMs perform well on synthetic reasoning tasks but are fragile.

02

Performance is negatively impacted by memorization and pattern recognition.

03

Synthetic data effectively proxies natural reasoning tasks for large-scale testing.

Abstract

Many reasoning, planning, and problem-solving tasks share an intrinsic algorithmic nature: correctly simulating each step is a sufficient condition to solve them correctly. We collect pairs of naturalistic and synthetic reasoning tasks to assess the capabilities of Large Language Models (LLM). While naturalistic tasks often require careful human handcrafting, we show that synthetic data is, in many cases, a good proxy that is much easier to collect at scale. We leverage common constructs in programming as the counterpart of the building blocks of naturalistic reasoning tasks, such as straight-line programs, code that contains critical paths, and approximate and redundant instructions. We further assess the capabilities of LLMs on sorting problems and repeated operations via sorting algorithms and nested loops. Our synthetic datasets further reveal that while the most powerful LLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Model-Driven Software Engineering Techniques