Evaluating Ill-Defined Tasks in Large Language Models
Yi Zhou, Basel Shbita

TL;DR
This paper critically examines the challenges of evaluating large language models on ill-defined tasks, highlighting limitations of current benchmarks and proposing insights for more robust evaluation methods.
Contribution
The paper analyzes two case studies to identify common issues in evaluating ill-defined tasks and suggests the need for more interpretable and reliable evaluation approaches.
Findings
Current evaluations often conflate failure modes and are unstable.
Existing benchmarks lack coverage of real-world complexity.
Multi-faceted criteria can provide more actionable insights.
Abstract
Many evaluations of Large Language Models (LLMs) target tasks that are inherently ill-defined, with unclear input and output spaces and ambiguous success criteria. We analyze why existing evaluation benchmarks and metrics fail to provide reliable or diagnostic signals of model capability for such tasks. We examine two case studies: Complex Instruction Following (CIF), where we identify recurring issues including limited coverage of real-world instruction complexity, sensitivity to instruction phrasing, inconsistent and non-comparable metrics, and instability introduced by LLM-based judges; and Natural Language to Mermaid Sequence Diagrams (NL2Mermaid), where we show how multi-faceted evaluation criteria can yield actionable insights beyond aggregate scores. Together, these case studies show that current evaluations frequently conflate distinct failure modes, yielding scores that are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Topic Modeling · Artificial Intelligence in Healthcare and Education
