Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions

Emi Soroka; Tanmay Chopra; Krish Desai; Sanjay Lall

arXiv:2511.03047·cs.LG·November 6, 2025

Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions

Emi Soroka, Tanmay Chopra, Krish Desai, Sanjay Lall

PDF

Open Access 3 Reviews

TL;DR

This paper introduces unsupervised metrics for evaluating multi-turn, objective-driven interactions involving large language models, addressing challenges of data complexity, lack of labels, and unreliable human judgments.

Contribution

It presents the first set of unsupervised evaluation metrics that leverage statistical properties and fine-tuned LLMs to assess goal achievement without human annotations.

Findings

01

Metrics effectively label user goals and measure goal completion.

02

The approach adapts to distributional shifts in interaction data.

03

Validation shows reliable evaluation across open-domain and task-specific interactions.

Abstract

Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for specific errors, but not previously-undetected ones; and LLM judges can produce unreliable results. We introduce the first set of unsupervised metrics for objective-driven interactions, leveraging statistical properties of unlabeled interaction data and using fine-tuned LLMs to adapt to distributional shifts. We develop metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without grounding evaluations in human-generated ideal responses. Our approach is validated on open-domain and task-specific interaction data.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

Novelty and Practical Impact: The core idea of moving beyond LLM judges and human references to unsupervised, statistically-grounded metrics is highly novel and addresses a significant pain point in real-world AI system development. The potential for online monitoring and resource-saving interventions is a compelling practical contribution. Holistic Evaluation Framework: The paper doesn't propose a single metric but a suite of three complementary metrics that address different aspects of an inte

Weaknesses

Reliance on Key Assumptions: The methodology rests on two strong assumptions that may not always hold in practice: a single user goal per interaction and that failures are "rare." The performance degradation on Code-Feedback, where follow-up questions violate the "single well-defined end" assumption, highlights this fragility. The applicability in noisier, real-world environments with frequent failures is not fully established. Limited Statistical Rigor: While the results are promising, the stat

Reviewer 02Rating 2Confidence 4

Strengths

- The paper addresses key shortcomings in current LLM evaluation methods. - It presents extensive experiments demonstrating the effectiveness of the proposed approaches.

Weaknesses

- The overall paper structure could be improved. A concise introduction is fine, but it should still highlight the main points to orient the reader. - The selected attributes require a clear rationale for why they are important for evaluating LLM responses. - The metrics used to measure them also need clearer explanation and justification: how well do these metrics reflect the degree or quality of the attributes? - The title is potentially misleading. What exactly is meant by “unsupervised”? Is

Reviewer 03Rating 6Confidence 3

Strengths

1. The core ideas—unsupervised goal clustering via LLM+k-means hybridization, completion detection via distributional fine-tuning, and uncertainty quantification via response trees—are highly original. 2. The methods address a real and pressing need in enterprise AI development. The ability to perform evaluation without labels or human judges is a substantial practical advance.

Weaknesses

1. The method assumes "a majority of interactions are complete" to train the completion detector, which fundamentally undermines its unsupervised nature and creates a validation problem. 2. The pipeline likely depends on embedding model, initial k, prompt phrasing for merges/labels, and decoding hyperparameters. Robustness to these choices is not fully demonstrated.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Ethics and Social Impacts of AI