*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

Quentin Lemesle; L\'eane Jourdan; Daisy Munson; Pierre Alain; Jonathan Chevelu; Arnaud Delhay; Damien Lolive

arXiv:2602.15778·cs.CL·February 18, 2026

*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

Quentin Lemesle, L\'eane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu, Arnaud Delhay, Damien Lolive

PDF

Open Access

TL;DR

This paper introduces *-PLUIE, a personalized, efficient LLM-based evaluation metric for generated text that aligns well with human judgments and reduces computational costs.

Contribution

It develops task-specific prompting variants of ParaPLUIE, enhancing alignment with human ratings and improving evaluation efficiency.

Findings

01

* -PLUIE achieves higher correlation with human judgments.

02

It maintains low computational cost compared to existing LLM-judge methods.

03

Personalized prompts improve evaluation accuracy.

Abstract

Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No'' answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Expert finding and Q&A systems · Text Readability and Simplification