PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

Minki Hong; Eunsoo Lee; Sohyun Park; Jihie Kim

arXiv:2603.10477·cs.CL·April 9, 2026

PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

Minki Hong, Eunsoo Lee, Sohyun Park, Jihie Kim

PDF

TL;DR

PEEM introduces a comprehensive, interpretable evaluation framework for prompts and responses in large language models, enhancing diagnostic insights and guiding prompt optimization.

Contribution

It proposes a unified rubric with LLM-based evaluators for joint prompt and response assessment, improving interpretability and actionable feedback.

Findings

01

PEEM scores strongly align with traditional accuracy metrics.

02

The framework captures diverse linguistic failure modes.

03

Prompt rewriting with PEEM scores improves downstream accuracy by up to 11.7 points.

Abstract

Prompt design is a primary control interface for large language models (LLMs), yet standard evaluations largely reduce performance to answer correctness, obscuring why a prompt succeeds or fails and providing little actionable guidance. We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses. PEEM defines a structured rubric with 9 axes: 3 prompt criteria (clarity/structure, linguistic quality, fairness) and 6 response criteria (accuracy, coherence, relevance, objectivity, clarity, conciseness), and uses an LLM-based evaluator to output (i) scalar scores on a 1-5 Likert scale and (ii) criterion-specific natural-language rationales grounded in the rubric. Across 7 benchmarks and 5 task models, PEEM's accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.