*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation
Quentin Lemesle, L\'eane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu, Arnaud Delhay, Damien Lolive

TL;DR
This paper introduces *-PLUIE, a personalized, efficient LLM-based evaluation metric for generated text that aligns well with human judgments and reduces computational costs.
Contribution
It develops task-specific prompting variants of ParaPLUIE, enhancing alignment with human ratings and improving evaluation efficiency.
Findings
* -PLUIE achieves higher correlation with human judgments.
It maintains low computational cost compared to existing LLM-judge methods.
Personalized prompts improve evaluation accuracy.
Abstract
Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No'' answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Expert finding and Q&A systems · Text Readability and Simplification
