Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis

Penny Chong; Harshavardhan Abichandani; Jiyuan Shen; Atin Ghosh; Min Pyae Moe; Yifan Mai; Daniel Dahlmeier

arXiv:2603.15483·cs.AI·March 17, 2026

Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis

Penny Chong, Harshavardhan Abichandani, Jiyuan Shen, Atin Ghosh, Min Pyae Moe, Yifan Mai, Daniel Dahlmeier

PDF

Open Access 3 Reviews

TL;DR

This paper presents the TED framework for comprehensive agent evaluation, incorporating user roles, conversation quality, and automated error diagnosis to improve understanding and performance of conversational agents.

Contribution

The TED framework introduces a unified, user-aware evaluation approach with automated error analysis, advancing beyond traditional correctness metrics.

Findings

01

Reveals new insights into agent performance across models and user expertise levels.

02

Achieves 8-10% improvements on proposed metrics after error-based refinements.

03

Automates error diagnosis to provide actionable feedback for agent enhancement.

Abstract

Agent applications are increasingly adopted to automate workflows across diverse tasks. However, due to the heterogeneous domains they operate in, it is challenging to create a scalable evaluation framework. Prior works each employ their own methods to determine task success, such as database lookups, regex match, etc., adding complexity to the development of a unified agent evaluation approach. Moreover, they do not systematically account for the user's role nor expertise in the interaction, providing incomplete insights into the agent's performance. We argue that effective agent evaluation goes beyond correctness alone, incorporating conversation quality, efficiency and systematic diagnosis of agent errors. To address this, we introduce the TED framework (Talk, Evaluate, Diagnose). (1) Talk: We leverage reusable, generic expert and non-expert user persona templates for user-agent…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 10Confidence 4

Strengths

1) The general method described is clear and the authors seem to take a principled approach to designing their metrics. 2) It is interesting to see metrics like meanProg give more details into LLM performance when it comes to agents and uncovering those errors are important. 3) It was good to see the authors use the errors discovered to improve the agents. This leaves future work on how to better incorporate the learnings from their error analysis.

Weaknesses

1) Both datasets are in the task-oriented domain. Given that these metrics seem to be domain agnostic it would be good to see how these can be translated to other domains. 2) One thing that still isn't clear to me is that while I understand that these metrics can give more insight to an agent are we able to find some correlation between user experience and these metrics? When it comes to evaluation metrics most work show some correlation between the two.

Reviewer 02Rating 2Confidence 3

Strengths

The paper tackles a timely and important challenge in evaluating conversational or task-oriented agents, where the stochastic and open-ended nature of responses makes it difficult to define reliable ground-truths. I particularly appreciate the exploration of subgoal decomposition and the use of grading notes, which offer a valuable and interpretable approach to fine-grained agent evaluation.

Weaknesses

My main concerns lie in the soundness and clarity of the paper. In particular, Section 3.2—one of the core components—raises questions about methodological soundness, which in turn affects my confidence in the validity of the experimental design. Beyond this, the motivations behind several key design choices are insufficiently explained, and multiple mathematical notations are either ambiguous or undefined. Further details are provided in the Questions/Concerns section.

Reviewer 03Rating 4Confidence 2

Strengths

1. The general sentiment that superficial metrics are insufficient to allow practitioners to build better agents seems well justified. 2. The proposed framework seems sensible, and follows common practices in building agent benchmarks (i.e. look at the data).

Weaknesses

1. The writing of this paper could be clearer, parts of the paper can be difficult to follow. In particular the results section could benefit from clearer structuring, e.g. numbering and clearly separating individual insights/observations. 2. The motivation for the combined three-part approach is unclear to me. Whilst each part on its own seems sensible enough it's unclear to me why the parts need to be introduced together. It's not clear that one part really depends on the other. This disconnec

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multi-Agent Systems and Negotiation · Multimodal Machine Learning Applications