Learning Evaluation Models from Large Language Models for Sequence Generation

Chenglong Wang; Hang Zhou; Kaiyan Chang; Tongran Liu; Chunliang Zhang; Quan Du; Tong Xiao; Yue Zhang; and Jingbo Zhu

arXiv:2308.04386·cs.CL·June 27, 2025·1 cites

Learning Evaluation Models from Large Language Models for Sequence Generation

Chenglong Wang, Hang Zhou, Kaiyan Chang, Tongran Liu, Chunliang Zhang, Quan Du, Tong Xiao, Yue Zhang, and Jingbo Zhu

PDF

Open Access 1 Repo

TL;DR

This paper introduces CSEM, a novel method for training sequence evaluation models using large language models to generate labeled data, eliminating the need for human annotations and improving evaluation accuracy across various scenarios.

Contribution

The paper proposes CSEM, a three-stage training approach that leverages large language models for data generation, supporting diverse evaluation types and enhancing sequence quality assessment.

Findings

01

CSEM effectively trains evaluation models without human-labeled data.

02

Metrics developed via CSEM outperform traditional metrics in sequence quality.

03

CSEM improves evaluation accuracy in reinforcement learning and reranking tasks.

Abstract

Automatic evaluation of sequence generation, traditionally reliant on metrics like BLEU and ROUGE, often fails to capture the semantic accuracy of generated text sequences due to their emphasis on n-gram overlap. A promising solution to this problem is to develop model-based metrics, such as BLEURT and COMET. However, these approaches are typically hindered by the scarcity of labeled evaluation data, which is necessary to train the evaluation models. In this work, we build upon this challenge by proposing the Customized Sequence Evaluation Metric (CSEM), a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development, thereby eliminating the need for human-labeled data. Additionally, we expand the scope of CSEM to support various evaluation types, including single-aspect, multi-aspect, reference-free, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wangclnlp/csem
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques