Demystifying Errors in LLM Reasoning Traces: An Empirical Study of Code Execution Simulation
Mohammad Abdollahi, Khandaker Rifah Tasnia, Soumit Kanti Saha, Jinqiu Yang, Song Wang, Hadi Hemmati

TL;DR
This study empirically investigates the reasoning errors of large language models in code execution, revealing common failure modes and demonstrating that tool augmentation can significantly improve reasoning accuracy.
Contribution
It provides the first detailed taxonomy of reasoning errors in LLMs and evaluates the effectiveness of tool-augmented reasoning in correcting these errors.
Findings
Models achieve 85-98% accuracy on code reasoning tasks.
Nine categories of inference errors are identified.
Tool augmentation corrects 58% of computation errors.
Abstract
Understanding a program's runtime reasoning behavior, meaning how intermediate states and control flows lead to final execution results, is essential for reliable code generation, debugging, and automated reasoning. Although large language models (LLMs) can accurately predict program outputs, most prior work has focused on output accuracy and performance, treating reasoning as a black box. As a result, little is known about the structure or failure modes of their reasoning traces. To address this gap, we conduct the first empirical study on runtime behavior inference with reasoning LLMs, aiming to uncover and characterize errors in their reasoning traces. We curate a benchmark from HumanEval Plus and LiveCodeBench, containing 427 code snippets. For each snippet, we test three input types: regular, edge, and invalid. Twelve input values are selected per snippet, each paired with its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software System Performance and Reliability
