Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation

Yanzhi Tian; Cunxiang Wang; Zeming Liu; Heyan Huang; Wenbo Yu; Dawei Song; Jie Tang; Yuhang Guo

arXiv:2601.07338·cs.CL·April 17, 2026

Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation

Yanzhi Tian, Cunxiang Wang, Zeming Liu, Heyan Huang, Wenbo Yu, Dawei Song, Jie Tang, Yuhang Guo

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces MENT, a new dataset for evaluating non-literal translation quality, and proposes RATE, a novel framework that improves MT evaluation accuracy by addressing traditional metrics' limitations.

Contribution

The paper presents MENT, a curated dataset for non-literal translation evaluation, and introduces RATE, a reflective agentic framework that enhances evaluation reliability.

Findings

01

Traditional MT metrics are inaccurate for non-literal translations.

02

LLM-based evaluation methods face knowledge cutoff and inconsistency issues.

03

RATE improves correlation with human judgments by at least 3.2 points.

Abstract

Large Language Models (LLMs) have significantly advanced Machine Translation (MT), applying them to linguistically complex domains-such as Social Network Services, literature etc. In these scenarios, translations often require handling non-literal expressions, leading to the inaccuracy of MT metrics. To systematically investigate the reliability of MT metrics, we first curate a meta-evaluation dataset focused on non-literal translations, namely MENT. MENT encompasses four non-literal translation domains and features source sentences paired with translations from diverse MT systems, with 7,530 human-annotated scores on translation quality. Experimental results reveal the inaccuracies of traditional MT metrics and the limitations of LLM-as-a-Judge, particularly the knowledge cutoff and score inconsistency problem. To mitigate these limitations, we propose RATE, a novel agentic translation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BITHLP/RATE
github

Datasets

yztian/MENT
dataset· 39 dl
39 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.