AutoLibra: Agent Metric Induction from Open-Ended Human Feedback
Hao Zhu, Phil Cuvin, Xinkai Yu, Charlotte Ka Yee Yan, Jason Zhang, Diyi Yang

TL;DR
AutoLibra is a framework that transforms open-ended human feedback into concrete, fine-grained evaluation metrics for agents, enabling better assessment and improvement of agent behaviors beyond coarse success measures.
Contribution
AutoLibra introduces a novel method for grounding, clustering, and defining agent behavior metrics from open-ended feedback, enhancing evaluation and optimization of language agents.
Findings
AutoLibra induces more concrete evaluation metrics than previous benchmarks.
AutoLibra can identify new metrics for analyzing agent behaviors.
AutoLibra supports agent self-regulation and iterative prompt improvement.
Abstract
Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose **AutoLibra**, a framework for agent evaluation, that transforms open-ended human feedback *e.g.* "If you find that the button is disabled, don't click it again", or "This agent has too much autonomy to decide what to do on its own" into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent's behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta metrics to evaluate the alignment of a set of (induced) metrics with open feedback: "coverage" and…
Peer Reviews
Decision·ICLR 2026 Poster
1. Timely problem & clear formulation. Moving beyond task-success rates to behavioral evaluation is important for LLM agents interacting with humans. The coverage/redundancy meta-metrics give a principled way to select metric sets that actually reflect what people say they want. 2. Interpretability & reusability. Metric definitions with examples are human-legible and reusable across tasks, which is valuable for human–AI interaction workflows (e.g., aligning internal diagnostics with user-visib
1. Judge-dependence & circularity. Every stage (grounding, clustering, judging) leans on LLMs. Without strong cross-judge and human-only checks, it risks metric drift or Goodhart effects (agents optimize to a judge’s quirks rather than human satisfaction). 2. Sensitivity of the meta-metrics. Coverage and redundancy depend on how “aspects” are extracted and granularized; small parsing or clustering changes could alter the frontier. The paper would be stronger with sensitivity analyses (judge mod
- Produces fine-grained, actionable behavioral metrics (e.g., Access Barrier Handling, Error Recovery and Adjustment, Navigation Accuracy). - Discovers failure modes that were not captured in pre-existing benchmark taxonomies (e.g., WebVoyager: Query/Search Strategy Efficiency (approx 7%), Final Output Quality ( approx18%)). - Demonstrates self-regulated improvement: optimizing induced metrics yields ~20% success gain on Baba-is-AI without directly optimizing success rate. - Step-wise human v
In general there are not too many strong weaknesses: - LLM dependence and limited visibility in clustering: Please report clustering stability: fix the optimal number of metrics (N), run at least three different random seeds, and quantify how similar the resulting metric sets are (for example, by matching clusters and comparing overlap, or by providing a small human-judged semantic comparison across samples). - Generalization not demonstrated across domains: A small cross-dataset test would
- The loop to extract metrics and automatically measure coverage is novel (from what I have seen). - These metrics should help the autograders score new cases (pos and negative examples is a nice touch). - validated approach on 20% held out set. - metrics as a function of observed trajectories is a very cool idea.
- only 118 trajectories human labelled, with each trajectory only taking 5 mins. This is quite a small sample size IMO. It would be good to see this methods applied to and validated against a larger set. - This method heavily relies on the LLM performance. The paper should dedicate more ablation studies and effort into current LLM proficiency at this task. - given the importance of LLM performance at this task I think a method such as the one detailed here https://arxiv.org/abs/2507.03772, whic
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
MethodsSparse Evolutionary Training
