TaskEval: Synthesised Evaluation for Foundation-Model Tasks
Dilani Widanapathiranage, Scott Barnett, Stefanus Kurniawan, Wannita Takerngsaksiri

TL;DR
TaskEval introduces a novel, task-agnostic approach to synthesise custom evaluators for foundation-model tasks, enabling automated, human-in-the-loop assessment without relying on existing datasets or metrics.
Contribution
It proposes a meta-model, interaction protocol, and eval synthesiser to create task-specific evaluators, addressing the challenge of evaluating FM outputs in diverse applications.
Findings
Achieved 93% and 90% accuracy in eval quality for two FM tasks.
Demonstrated effectiveness on chart data extraction and document QA tasks.
Provides a flexible framework for automating FM output evaluation.
Abstract
Hallucinations are a key concern when creating applications that rely on Foundation models (FMs). Understanding where and how these subtle failures occur in an application relies on evaluation methods known as \textit{evals}. Prior work focuses on defining new eval methods or benchmark datasets for specific tasks. However, neither helps a software team with a task-specific FM application when there is no metric or dataset. The demand for both automated approaches and deep integration of human insight makes this a challenging problem. We address this gap by proposing an approach to synthesise a FM task-specific evaluator program that provides automation and a custom UI for capturing feedback. The core novelty of our approach lies in: (1) a task-agnostic meta-model that captures properties of any FM task, (2) an interaction protocol for efficient use of human feedback, and (3) an eval…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Adversarial Robustness in Machine Learning · Mental Health via Writing
