LLMs Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods
Hyang Cui

TL;DR
This paper introduces a generation-based approach for machine translation quality estimation using large language models, which outperforms traditional scoring methods and supports hybrid evaluation strategies.
Contribution
It proposes a novel generation-based evaluation paradigm leveraging decoder-only LLMs for high-quality reference generation and semantic similarity scoring, improving MTQE accuracy.
Findings
Generation-based evaluation outperforms direct scoring methods.
Method surpasses external non-LLM metrics in correlation with human judgments.
Extensive evaluation across multiple LLMs and language pairs confirms effectiveness.
Abstract
Recent studies have applied large language models (LLMs) to machine translation quality estimation (MTQE) by prompting models to assign numeric scores. Nonetheless, these direct scoring methods tend to show low segment-level correlation with human judgments. In this paper, we propose a generation-based evaluation paradigm that leverages decoder-only LLMs to produce high-quality references, followed by semantic similarity scoring using sentence embeddings. We conduct the most extensive evaluation to date in MTQE, covering 8 LLMs and 8 language pairs. Empirical results show that our method outperforms both intra-LLM direct scoring baselines and external non-LLM reference-free metrics from MTME. These findings demonstrate the strength of generation-based evaluation and support a shift toward hybrid approaches that combine fluent generation with accurate semantic assessment.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
