GPTScore: Evaluate as You Desire
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, Pengfei Liu

TL;DR
GPTScore introduces a flexible, instruction-based evaluation framework leveraging pre-trained models' emergent abilities to assess generated texts across multiple facets without requiring annotated data.
Contribution
The paper presents GPTScore, a novel evaluation method that uses pre-trained models' zero-shot capabilities for customizable, multi-dimensional text quality assessment.
Findings
Effective evaluation across four text generation tasks
Ability to customize evaluation criteria via natural language instructions
No need for annotated datasets for different evaluation aspects
Abstract
Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities (e.g., zero-shot instruction) of generative pre-trained models to score generated texts. There are 19 pre-trained models explored in this paper, ranging in size from 80M (e.g., FLAN-T5-small) to 175B (e.g., GPT3). Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational Physics and Python Applications
