When "Better" Prompts Hurt: Evaluation-Driven Iteration for LLM Applications
Daniel Commey

TL;DR
This paper introduces an evaluation-driven workflow and a tiered evaluation suite for improving and diagnosing large language model applications, emphasizing careful prompt iteration over universal prompts.
Contribution
It presents a structured, repeatable evaluation process and the Minimum Viable Evaluation Suite (MVES) for LLMs, addressing challenges of stochastic outputs and prompt sensitivity.
Findings
Generic prompt templates can reduce extraction pass rates and RAG compliance.
Evaluation-driven prompt iteration improves instruction-following.
Careful claim calibration is essential over universal prompt recipes.
Abstract
Evaluating Large Language Model (LLM) applications differs from traditional software testing because outputs are stochastic, high-dimensional, and sensitive to prompt and model changes. We present an evaluation-driven workflow - Define, Test, Diagnose, Fix - that turns these challenges into a repeatable engineering loop. We introduce the Minimum Viable Evaluation Suite (MVES), a tiered set of recommended evaluation components for (i) general LLM applications, (ii) retrieval-augmented generation (RAG), and (iii) agentic tool-use workflows. We also synthesize common evaluation methods (automated checks, human rubrics, and LLM-as-judge) and discuss known judge failure modes. In reproducible local experiments (Ollama; Llama 3 8B Instruct and Qwen 2.5 7B Instruct), we observe that a generic "improved" prompt template can trade off behaviors: on our small structured suites, extraction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research
