LLMs Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods

Hyang Cui

arXiv:2505.16129·cs.CL·May 23, 2025

LLMs Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods

Hyang Cui

PDF

Open Access 1 Repo

TL;DR

This paper introduces a generation-based approach for machine translation quality estimation using large language models, which outperforms traditional scoring methods and supports hybrid evaluation strategies.

Contribution

It proposes a novel generation-based evaluation paradigm leveraging decoder-only LLMs for high-quality reference generation and semantic similarity scoring, improving MTQE accuracy.

Findings

01

Generation-based evaluation outperforms direct scoring methods.

02

Method surpasses external non-LLM metrics in correlation with human judgments.

03

Extensive evaluation across multiple LLMs and language pairs confirms effectiveness.

Abstract

Recent studies have applied large language models (LLMs) to machine translation quality estimation (MTQE) by prompting models to assign numeric scores. Nonetheless, these direct scoring methods tend to show low segment-level correlation with human judgments. In this paper, we propose a generation-based evaluation paradigm that leverages decoder-only LLMs to produce high-quality references, followed by semantic similarity scoring using sentence embeddings. We conduct the most extensive evaluation to date in MTQE, covering 8 LLMs and 8 language pairs. Empirical results show that our method outperforms both intra-LLM direct scoring baselines and external non-LLM reference-free metrics from MTME. These findings demonstrate the strength of generation-based evaluation and support a shift toward hybrid approaches that combine fluent generation with accurate semantic assessment.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cuiniki/llms-are-not-scorers
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification