TaskEval: Synthesised Evaluation for Foundation-Model Tasks

Dilani Widanapathiranage; Scott Barnett; Stefanus Kurniawan; Wannita Takerngsaksiri

arXiv:2512.04442·cs.AI·December 8, 2025

TaskEval: Synthesised Evaluation for Foundation-Model Tasks

Dilani Widanapathiranage, Scott Barnett, Stefanus Kurniawan, Wannita Takerngsaksiri

PDF

Open Access

TL;DR

TaskEval introduces a novel, task-agnostic approach to synthesise custom evaluators for foundation-model tasks, enabling automated, human-in-the-loop assessment without relying on existing datasets or metrics.

Contribution

It proposes a meta-model, interaction protocol, and eval synthesiser to create task-specific evaluators, addressing the challenge of evaluating FM outputs in diverse applications.

Findings

01

Achieved 93% and 90% accuracy in eval quality for two FM tasks.

02

Demonstrated effectiveness on chart data extraction and document QA tasks.

03

Provides a flexible framework for automating FM output evaluation.

Abstract

Hallucinations are a key concern when creating applications that rely on Foundation models (FMs). Understanding where and how these subtle failures occur in an application relies on evaluation methods known as \textit{evals}. Prior work focuses on defining new eval methods or benchmark datasets for specific tasks. However, neither helps a software team with a task-specific FM application when there is no metric or dataset. The demand for both automated approaches and deep integration of human insight makes this a challenging problem. We address this gap by proposing an approach to synthesise a FM task-specific evaluator program that provides automation and a custom UI for capturing feedback. The core novelty of our approach lies in: (1) a task-agnostic meta-model that captures properties of any FM task, (2) an interaction protocol for efficient use of human feedback, and (3) an eval…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Adversarial Robustness in Machine Learning · Mental Health via Writing