TL;DR
This paper introduces RTCE, a benchmark for evaluating bidirectional code reasoning in LLMs, revealing current models' limitations in internal coherence and invertibility.
Contribution
The paper presents RTCE, a new benchmark for assessing round-trip code reasoning, and demonstrates that current LLMs struggle with internal consistency and algorithmic understanding.
Findings
Models often pass individual forward and backward tasks but fail the combined round-trip.
Supervised fine-tuning and self-reflection do not significantly improve bidirectional reasoning.
Failures occur even on simple bijections like RLE, indicating fundamental limitations.
Abstract
LLMs demonstrate strong performance on code benchmarks, yet consistent reasoning across forward and backward execution remains elusive. We present RoundTripCodeEval (RTCE), a benchmark of four code execution reasoning tasks that evaluates round-trip consistency through execution-free, exact-match assessment of bijection fidelity across four lossless compression algorithms. We evaluate state-of-the-art Code-LLMs under zero-shot prompting, supervised fine-tuning on execution traces, and iterative self-reflection. All approaches yield only modest improvements and none closes the gap, revealing that current LLMs lack the internal coherence required for reliable bidirectional code reasoning. RTCE surfaces findings invisible to existing benchmarks: models frequently pass individual forward and backward tasks yet fail the combined round-trip, exposing mutually inconsistent internal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
