PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation
Christoph Leiter, Steffen Eger

TL;DR
This paper conducts an extensive large-scale evaluation of open-source LLM prompts for machine translation and summarization metrics, analyzing stability, variability, and best practices across over 6.6 million evaluations.
Contribution
It introduces PrExMe, a comprehensive benchmark of 720 prompt templates for LLM-based evaluation metrics, revealing insights into prompt stability and prompting strategy effects.
Findings
Prompt stability varies across models and prompts.
Small prompt modifications can significantly alter rankings.
Some models prefer textual labels while others favor numeric scores.
Abstract
Large language models (LLMs) have revolutionized NLP research. Notably, in-context learning enables their use as evaluation metrics for natural language generation, making them particularly advantageous in low-resource scenarios and time-restricted applications. In this work, we introduce PrExMe, a large-scale Prompt Exploration for Metrics, where we evaluate more than 720 prompt templates for open-source LLM-based metrics on machine translation (MT) and summarization datasets, totalling over 6.6M evaluations. This extensive comparison (1) benchmarks recent open-source LLMs as metrics and (2) explores the stability and variability of different prompting strategies. We discover that, on the one hand, there are scenarios for which prompts are stable. For instance, some LLMs show idiosyncratic preferences and favor to grade generated texts with textual labels while others prefer to return…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
