When "Better" Prompts Hurt: Evaluation-Driven Iteration for LLM Applications

Daniel Commey

arXiv:2601.22025·cs.CL·January 30, 2026

When "Better" Prompts Hurt: Evaluation-Driven Iteration for LLM Applications

Daniel Commey

PDF

Open Access

TL;DR

This paper introduces an evaluation-driven workflow and a tiered evaluation suite for improving and diagnosing large language model applications, emphasizing careful prompt iteration over universal prompts.

Contribution

It presents a structured, repeatable evaluation process and the Minimum Viable Evaluation Suite (MVES) for LLMs, addressing challenges of stochastic outputs and prompt sensitivity.

Findings

01

Generic prompt templates can reduce extraction pass rates and RAG compliance.

02

Evaluation-driven prompt iteration improves instruction-following.

03

Careful claim calibration is essential over universal prompt recipes.

Abstract

Evaluating Large Language Model (LLM) applications differs from traditional software testing because outputs are stochastic, high-dimensional, and sensitive to prompt and model changes. We present an evaluation-driven workflow - Define, Test, Diagnose, Fix - that turns these challenges into a repeatable engineering loop. We introduce the Minimum Viable Evaluation Suite (MVES), a tiered set of recommended evaluation components for (i) general LLM applications, (ii) retrieval-augmented generation (RAG), and (iii) agentic tool-use workflows. We also synthesize common evaluation methods (automated checks, human rubrics, and LLM-as-judge) and discuss known judge failure modes. In reproducible local experiments (Ollama; Llama 3 8B Instruct and Qwen 2.5 7B Instruct), we observe that a generic "improved" prompt template can trade off behaviors: on our small structured suites, extraction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research