PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine   Translation and Summarization Evaluation

Christoph Leiter; Steffen Eger

arXiv:2406.18528·cs.CL·November 19, 2024

PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation

Christoph Leiter, Steffen Eger

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper conducts an extensive large-scale evaluation of open-source LLM prompts for machine translation and summarization metrics, analyzing stability, variability, and best practices across over 6.6 million evaluations.

Contribution

It introduces PrExMe, a comprehensive benchmark of 720 prompt templates for LLM-based evaluation metrics, revealing insights into prompt stability and prompting strategy effects.

Findings

01

Prompt stability varies across models and prompts.

02

Small prompt modifications can significantly alter rankings.

03

Some models prefer textual labels while others favor numeric scores.

Abstract

Large language models (LLMs) have revolutionized NLP research. Notably, in-context learning enables their use as evaluation metrics for natural language generation, making them particularly advantageous in low-resource scenarios and time-restricted applications. In this work, we introduce PrExMe, a large-scale Prompt Exploration for Metrics, where we evaluate more than 720 prompt templates for open-source LLM-based metrics on machine translation (MT) and summarization datasets, totalling over 6.6M evaluations. This extensive comparison (1) benchmarks recent open-source LLMs as metrics and (2) explores the stability and variability of different prompting strategies. We discover that, on the one hand, there are scenarios for which prompts are stable. For instance, some LLMs show idiosyncratic preferences and favor to grade generated texts with textual labels while others prefer to return…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gringham/prexme
noneOfficial

Videos

PrExMe: Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling