TRACE: Evaluating Execution Efficiency of LLM-Based Code Translation
Zhihao Gong, Zeyu Sun, Dong Huang, Qingyuan Liang, Jie M. Zhang, Dan Hao

TL;DR
This paper introduces TRACE, a benchmark for evaluating execution efficiency in LLM-translated code, revealing that correctness does not imply efficiency and that inefficiencies are widespread across models.
Contribution
The paper presents TRACE, the first benchmark explicitly assessing efficiency in LLM-based code translation, and provides extensive evaluation insights across multiple models.
Findings
23.5% of correct translations are inefficient
Inaccuracy in efficiency is patterned across algorithmic faults and resource issues
Prompt strategies only modestly improve inference-time efficiency
Abstract
While Large Language Models (LLMs) have substantially improved the functional correctness of code translation, the critical dimension of \textit{execution efficiency} remains overlooked. We present \textbf{\textsc{trace}}, the first benchmark to explicitly assess efficiency in LLM-translated code. \textsc{trace} includes 1,000 efficiency-critical tasks across C++, Java, and Python, each augmented with stress tests that reveal efficiency degradations often overlooked by small-scale tests. Using \textsc{trace}, we conduct an extensive evaluation of 28 representative LLMs and highlight several key insights: 1) Correctness is not a reliable proxy for efficiency: the correctness leader \textit{Claude-4-think} achieves only mid-level time efficiency, outperformed by smaller open-source LLMs such as \textit{Qwen2.5-Coder-14B-Instruct}. 2) Inefficiency is both prevalent and patterned: 23.5\% of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
