PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses
Minki Hong, Eunsoo Lee, Sohyun Park, Jihie Kim

TL;DR
PEEM introduces a comprehensive, interpretable evaluation framework for prompts and responses in large language models, enhancing diagnostic insights and guiding prompt optimization.
Contribution
It proposes a unified rubric with LLM-based evaluators for joint prompt and response assessment, improving interpretability and actionable feedback.
Findings
PEEM scores strongly align with traditional accuracy metrics.
The framework captures diverse linguistic failure modes.
Prompt rewriting with PEEM scores improves downstream accuracy by up to 11.7 points.
Abstract
Prompt design is a primary control interface for large language models (LLMs), yet standard evaluations largely reduce performance to answer correctness, obscuring why a prompt succeeds or fails and providing little actionable guidance. We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses. PEEM defines a structured rubric with 9 axes: 3 prompt criteria (clarity/structure, linguistic quality, fairness) and 6 response criteria (accuracy, coherence, relevance, objectivity, clarity, conciseness), and uses an LLM-based evaluator to output (i) scalar scores on a 1-5 Likert scale and (ii) criterion-specific natural-language rationales grounded in the rubric. Across 7 benchmarks and 5 task models, PEEM's accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
