STEPWISE-CODEX-Bench: Evaluating Complex Multi-Function Comprehension and Fine-Grained Execution Reasoning
Kaiwen Yan, Yuhang Chang, Zirui Guo, Yaling Mou, Jiang Ming, Jingwei Sun

TL;DR
STEPWISE-CODEX-Bench is a new benchmark that evaluates complex multi-function understanding and detailed execution reasoning in large language models, revealing their limitations in dynamic, multi-step code comprehension.
Contribution
The paper introduces SX-Bench, a benchmark for assessing multi-function and fine-grained reasoning in code understanding, with an automated pipeline for generation and validation.
Findings
State-of-the-art models perform significantly below perfect on complex tasks.
SX-Bench effectively discriminates between model reasoning capabilities.
Reveals bottlenecks in current models' dynamic execution understanding.
Abstract
In recent years, large language models (LLMs) have made significant progress in code intelligence, yet systematically evaluating their code understanding and reasoning abilities remains challenging. Mainstream benchmarks such as HumanEval and MBPP primarily assess functional correctness, while reasoning benchmarks like CRUXEVAL are limited to single-function, low-complexity scenarios. As a result, advanced models achieve nearly saturated scores, limiting their discriminative power. To address this, we present STEPWISE-CODEX-Bench (SX-Bench), a novel benchmark designed for complex multi-function understanding and fine-grained execution reasoning. SX-Bench features tasks involving collaboration among multiple sub-functions (e.g., chained calls, nested loops), shifting evaluation towards overall control and data flow modeling. It defines "computation steps" as the minimal execution unit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
