From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python
Jinhua Wang, Biswa Sengupta

TL;DR
This paper presents a methodology for translating a large production codebase from Rust to Python using LLMs, driven by benchmarks, resulting in a capable, extended AI agent with near-parity performance.
Contribution
It introduces a benchmark-driven, iterative translation process for large codebases, enabling continuous synchronization and feature extension in the target language.
Findings
Python port achieves near-parity on real-world tasks compared to Rust.
Benchmark-driven debugging outperforms static testing methods.
Python version offers a 15.9x code reduction with minimal performance loss.
Abstract
Cross-language migration of large software systems is a persistent engineering challenge, particularly when the source codebase evolves rapidly. We present a methodology for LLM-assisted continuous code translation in which a large language model translates a production Rust codebase (648K LOC, 65 crates) into Python (41K LOC, 28 modules), with public agent benchmarks as the objective function driving iterative refinement. Our subject system is Codex CLI, a production AI coding agent. We demonstrate that: (1) the Python port resolves 59/80 SWE-bench Verified tasks (73.8%) versus Rust's 56/80 (70.0%), and achieves 42.5% on Terminal-Bench versus Rust's 47.5%, confirming near-parity on real-world agentic tasks; (2) benchmark-driven debugging, revealing API protocol mismatches, environment pollution, a silent WebSocket failure mode, and an API 400 crash, is more effective than static…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
