
TL;DR
This paper investigates Code World Models (CWMs), revealing their failure modes related to token budget exhaustion and string state limitations, and shows that improving action accuracy can enhance long-horizon state tracking.
Contribution
It provides a detailed analysis of CWMs' failure regimes and demonstrates that correct action generation significantly improves long-term state propagation.
Findings
Token budget exhaustion limits long execution traces.
String state limitations stem from subword tokenization.
Correct action replacement improves long-horizon accuracy.
Abstract
Code World Models (CWMs) are language models trained to simulate program execution by predicting explicit runtime state after every executed command. This execution-based world modeling enables internal verification within the model, offering an alternative to natural language chain-of-thought reasoning. However, the sources of errors and the nature of CWMs' limitations remain poorly understood. We study CWMs from two complementary perspectives: local semantic execution and long-horizon state tracking. On real-code benchmarks, we identify two dominant failure regimes. First, dense runtime state reveals produce token-intensive execution traces, leading to token-budget exhaustion on programs with long execution histories. Second, failures disproportionately concentrate in string-valued state, which we attribute to limitations of subword tokenization rather than program structure. To study…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software System Performance and Reliability · Parallel Computing and Optimization Techniques
