Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation
Fazle Rabbi, Soumit Kanti Saha, Jinqiu Yang

TL;DR
This paper reveals that many failures in LLM-based code translation are due to evaluation errors rather than actual translation mistakes, emphasizing the need for better evaluation standards.
Contribution
It conducts a large-scale empirical study across multiple languages and benchmarks to identify and categorize false negatives caused by evaluation setup issues.
Findings
Many reported failures are evaluation-induced errors, not incorrect logic.
False negatives are caused by improper compilation flags, missing libraries, and unconfigured environments.
A need for transparent, configuration-aware evaluation standards is highlighted.
Abstract
Large Language Models (LLMs) have achieved remarkable success in automated code translation. While prior work has focused on improving translation accuracy through advanced prompting and iterative repair, the reliability of the underlying evaluation frameworks has received less attention. In this paper, we demonstrate that a significant number of reported failures in code translation are not due to incorrect logic, but rather evaluation-induced errors stemming from improper compilation flags, missing library links, and unconfigured runtime environments. We conduct a large-scale empirical study across five programming languages (C, C++, Java, Python, Go) and three benchmarks (Avatar, CodeNet, EvalPlus), covering 6,164 translations generated by GPT-4o, DeepSeek-Coder, and Magicoder. Our analysis identifies and categorizes common false negatives, distinguishing pipeline-induced failures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
