AutoLibra: Agent Metric Induction from Open-Ended Human Feedback

Hao Zhu; Phil Cuvin; Xinkai Yu; Charlotte Ka Yee Yan; Jason Zhang; Diyi Yang

arXiv:2505.02820·cs.AI·October 31, 2025

AutoLibra: Agent Metric Induction from Open-Ended Human Feedback

Hao Zhu, Phil Cuvin, Xinkai Yu, Charlotte Ka Yee Yan, Jason Zhang, Diyi Yang

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

AutoLibra is a framework that transforms open-ended human feedback into concrete, fine-grained evaluation metrics for agents, enabling better assessment and improvement of agent behaviors beyond coarse success measures.

Contribution

AutoLibra introduces a novel method for grounding, clustering, and defining agent behavior metrics from open-ended feedback, enhancing evaluation and optimization of language agents.

Findings

01

AutoLibra induces more concrete evaluation metrics than previous benchmarks.

02

AutoLibra can identify new metrics for analyzing agent behaviors.

03

AutoLibra supports agent self-regulation and iterative prompt improvement.

Abstract

Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose **AutoLibra**, a framework for agent evaluation, that transforms open-ended human feedback *e.g.* "If you find that the button is disabled, don't click it again", or "This agent has too much autonomy to decide what to do on its own" into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent's behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta metrics to evaluate the alignment of a set of (induced) metrics with open feedback: "coverage" and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 5

Strengths

1. Timely problem & clear formulation. Moving beyond task-success rates to behavioral evaluation is important for LLM agents interacting with humans. The coverage/redundancy meta-metrics give a principled way to select metric sets that actually reflect what people say they want. 2. Interpretability & reusability. Metric definitions with examples are human-legible and reusable across tasks, which is valuable for human–AI interaction workflows (e.g., aligning internal diagnostics with user-visib

Weaknesses

1. Judge-dependence & circularity. Every stage (grounding, clustering, judging) leans on LLMs. Without strong cross-judge and human-only checks, it risks metric drift or Goodhart effects (agents optimize to a judge’s quirks rather than human satisfaction). 2. Sensitivity of the meta-metrics. Coverage and redundancy depend on how “aspects” are extracted and granularized; small parsing or clustering changes could alter the frontier. The paper would be stronger with sensitivity analyses (judge mod

Reviewer 02Rating 8Confidence 3

Strengths

- Produces fine-grained, actionable behavioral metrics (e.g., Access Barrier Handling, Error Recovery and Adjustment, Navigation Accuracy). - Discovers failure modes that were not captured in pre-existing benchmark taxonomies (e.g., WebVoyager: Query/Search Strategy Efficiency (approx 7%), Final Output Quality ( approx18%)). - Demonstrates self-regulated improvement: optimizing induced metrics yields ~20% success gain on Baba-is-AI without directly optimizing success rate. - Step-wise human v

Weaknesses

In general there are not too many strong weaknesses: - LLM dependence and limited visibility in clustering: Please report clustering stability: fix the optimal number of metrics (N), run at least three different random seeds, and quantify how similar the resulting metric sets are (for example, by matching clusters and comparing overlap, or by providing a small human-judged semantic comparison across samples). - Generalization not demonstrated across domains: A small cross-dataset test would

Reviewer 03Rating 6Confidence 4

Strengths

- The loop to extract metrics and automatically measure coverage is novel (from what I have seen). - These metrics should help the autograders score new cases (pos and negative examples is a nice touch). - validated approach on 20% held out set. - metrics as a function of observed trajectories is a very cool idea.

Weaknesses

- only 118 trajectories human labelled, with each trajectory only taking 5 mins. This is quite a small sample size IMO. It would be good to see this methods applied to and validated against a larger set. - This method heavily relies on the LLM performance. The paper should dedicate more ablation studies and effort into current LLM proficiency at this task. - given the importance of LLM performance at this task I think a method such as the one detailed here https://arxiv.org/abs/2507.03772, whic

Code & Models

Repositories

open-social-world/autolibra
noneOfficial

Datasets

open-social-world/autolibra
dataset· 64k dl
64k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies

MethodsSparse Evolutionary Training