TRACE: Evaluating Execution Efficiency of LLM-Based Code Translation
Zhihao Gong, Zeyu Sun, Dong Huang, Qingyuan Liang, Jie M. Zhang, Dan Hao

TL;DR
This paper introduces TRACE, a benchmark for evaluating execution efficiency in LLM-based code translation, revealing that correctness does not imply efficiency and that inefficiencies are widespread across models and languages.
Contribution
The paper presents TRACE, the first benchmark explicitly designed to assess efficiency in LLM-translated code, and provides a comprehensive evaluation of 28 models highlighting efficiency issues.
Findings
Correctness does not reliably indicate efficiency.
23.5% of correct translations are inefficient.
Inference-time prompt strategies only modestly improve efficiency.
Abstract
While Large Language Models (LLMs) have substantially improved the functional correctness of code translation, the critical dimension of \textit{execution efficiency} remains overlooked. We present \textbf{\textsc{trace}}, the first benchmark to explicitly assess efficiency in LLM-translated code. \textsc{trace} includes 1,000 efficiency-critical tasks across C++, Java, and Python, each augmented with stress tests that reveal efficiency degradations often overlooked by small-scale tests. Using \textsc{trace}, we conduct an extensive evaluation of 28 representative LLMs and highlight several key insights: 1) Correctness is not a reliable proxy for efficiency: the correctness leader \textit{Claude-4-think} achieves only mid-level time efficiency, outperformed by smaller open-source LLMs such as \textit{Qwen2.5-Coder-14B-Instruct}. 2) Inefficiency is both prevalent and patterned: 23.5\% of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Software Engineering Research · Software System Performance and Reliability
