The Path Not Taken: Duality in Reasoning about Program Execution
Eshgin Hasanov, Md Mahadi Hassan Sibat, Santu Karmaker, Aashish Yadavally

TL;DR
This paper introduces DexBench, a benchmark with dual reasoning tasks to evaluate large language models' understanding of program execution, focusing on causal comprehension beyond surface patterns.
Contribution
It proposes a novel duality-based evaluation framework and benchmark to better assess models' dynamic code reasoning capabilities.
Findings
Dual-path reasoning correlates with better code understanding.
13 LLMs evaluated show varying performance on the dual tasks.
DexBench effectively discriminates models' causal reasoning abilities.
Abstract
Large language models (LLMs) have shown remarkable capabilities across diverse coding tasks. However, their adoption requires a true understanding of program execution rather than relying on surface-level patterns. Existing benchmarks primarily focus on predicting program properties tied to specific inputs (e.g., code coverage, program outputs). As a result, they provide a narrow view of dynamic code reasoning and are prone to data contamination. We argue that understanding program execution requires evaluating its inherent duality through two complementary reasoning tasks: (i) predicting a program's observed behavior for a given input, and (ii) inferring how the input must be mutated toward a specific behavioral objective. Both tasks jointly probe a model's causal understanding of execution flow. We instantiate this duality in DexBench, a benchmark comprising 445 paired instances, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
