The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces
Subramanyam Sahoo

TL;DR
This paper introduces CTVP, a novel verification framework for detecting malicious behavior in code-generating language models by analyzing consistency in predicted execution traces across program transformations, supported by theoretical bounds.
Contribution
It proposes a new semantic orbit analysis method and the ARQ metric for provably detecting backdoors in untrusted code models, grounded in information theory.
Findings
Exponential growth of verification cost with orbit size
Theoretical bounds show non-gamifiability of adversaries
High false positive rates observed in initial tests
Abstract
Large language models (LLMs) increasingly generate code with minimal human oversight, raising critical concerns about backdoor injection and malicious behavior. We present Cross-Trace Verification Protocol (CTVP), a novel AI control framework that verifies untrusted code-generating models through semantic orbit analysis. Rather than directly executing potentially malicious code, CTVP leverages the model's own predictions of execution traces across semantically equivalent program transformations. By analyzing consistency patterns in these predicted traces, we detect behavioral anomalies indicative of backdoors. Our approach introduces the Adversarial Robustness Quotient (ARQ), which quantifies the computational cost of verification relative to baseline generation, demonstrating exponential growth with orbit size. Theoretical analysis establishes information-theoretic bounds showing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Software Engineering Research
