Beyond BLEU: A Semantic Evaluation Method for Code Translation

Julius N\"aumann; Sven Keidel; Amir Molzam Sharifloo; Mira Mezini

arXiv:2605.05282·cs.PL·May 8, 2026

Beyond BLEU: A Semantic Evaluation Method for Code Translation

Julius N\"aumann, Sven Keidel, Amir Molzam Sharifloo, Mira Mezini

PDF

TL;DR

This paper introduces a semantic evaluation method for code translation that assesses correctness based on execution outcomes, revealing limitations of traditional metrics like BLEU.

Contribution

It proposes a new semantic correctness score for code translation evaluation, applying compiler testing principles to better measure functional accuracy.

Findings

01

LLM-based decompilers outperform heuristic methods in semantic correctness.

02

BLEU scores show low correlation with actual semantic correctness.

03

Traditional syntactic metrics are inadequate for evaluating code translation quality.

Abstract

Code translation is one of the core capabilities of LLMs. However, evaluating the correctness of translations remains difficult, as commonly used metrics such as BLEU measure only syntactic similarity, disregarding program semantics. We propose a novel evaluation methodology for code translation tasks, emphasizing semantic equivalence over surface-level string similarity. Our approach applies established compiler testing methodology to a new domain, allowing the assessment of an LLM fine-tuned for binary lifting tasks (i.e. decompiling binaries to higher-level representations). We introduce a semantic correctness score, defined as the proportion of translations that produce correct execution outcomes, and demonstrate its application by evaluating LLM-based and heuristic decompilers. Our findings show that LLM-based approaches significantly outperform heuristic ones, while BLEU scores…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.