Beyond BLEU: A Semantic Evaluation Method for Code Translation
Julius N\"aumann, Sven Keidel, Amir Molzam Sharifloo, Mira Mezini

TL;DR
This paper introduces a semantic evaluation method for code translation that assesses correctness based on execution outcomes, revealing limitations of traditional metrics like BLEU.
Contribution
It proposes a new semantic correctness score for code translation evaluation, applying compiler testing principles to better measure functional accuracy.
Findings
LLM-based decompilers outperform heuristic methods in semantic correctness.
BLEU scores show low correlation with actual semantic correctness.
Traditional syntactic metrics are inadequate for evaluating code translation quality.
Abstract
Code translation is one of the core capabilities of LLMs. However, evaluating the correctness of translations remains difficult, as commonly used metrics such as BLEU measure only syntactic similarity, disregarding program semantics. We propose a novel evaluation methodology for code translation tasks, emphasizing semantic equivalence over surface-level string similarity. Our approach applies established compiler testing methodology to a new domain, allowing the assessment of an LLM fine-tuned for binary lifting tasks (i.e. decompiling binaries to higher-level representations). We introduce a semantic correctness score, defined as the proportion of translations that produce correct execution outcomes, and demonstrate its application by evaluating LLM-based and heuristic decompilers. Our findings show that LLM-based approaches significantly outperform heuristic ones, while BLEU scores…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
