TL;DR
AdaRubric is a task-adaptive evaluation framework for LLM agents that generates specific rubrics from task descriptions, evaluates trajectories with confidence-weighted scores, and improves reward learning and generalization.
Contribution
It introduces a novel adaptive rubric generation method that enhances evaluation accuracy and reward learning for LLM agents across diverse tasks and modalities.
Findings
Achieves Pearson r = 0.79 with human correlation, outperforming baselines.
Improves task success rates by 6.8-8.5% over the best baseline.
Generalizes zero-shot to unseen domains and multimodal agents.
Abstract
Evaluating LLM agent trajectories is fundamentally task-specific: a code-debugging agent should be judged on Correctness and Error Handling, not on Fluency or Safety. Yet the dominant paradigm -- LLM-as-Judge with a fixed rubric -- applies the same static dimensions regardless of task, producing systematic mis-evaluation. We present AdaRubric, a framework that (i) adaptively generates task-specific evaluation rubrics from task descriptions via LLM, (ii) evaluates agent trajectories step-by-step with confidence-weighted, per-dimension scoring, and (iii) produces dense reward signals for preference learning. Three composable filtering strategies, including the novel DimensionAwareFilter that provably prevents dimension-level quality masking, yield high-quality DPO preference pairs. On WebArena, ToolBench, and AgentBench, AdaRubric achieves Pearson r = 0.79 human correlation (+0.15 over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
