AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning

Liang Ding

arXiv:2603.21362·cs.AI·May 12, 2026

AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning

Liang Ding

PDF

1 Repo

TL;DR

AdaRubric is a task-adaptive evaluation framework for LLM agents that generates specific rubrics from task descriptions, evaluates trajectories with confidence-weighted scores, and improves reward learning and generalization.

Contribution

It introduces a novel adaptive rubric generation method that enhances evaluation accuracy and reward learning for LLM agents across diverse tasks and modalities.

Findings

01

Achieves Pearson r = 0.79 with human correlation, outperforming baselines.

02

Improves task success rates by 6.8-8.5% over the best baseline.

03

Generalizes zero-shot to unseen domains and multimodal agents.

Abstract

Evaluating LLM agent trajectories is fundamentally task-specific: a code-debugging agent should be judged on Correctness and Error Handling, not on Fluency or Safety. Yet the dominant paradigm -- LLM-as-Judge with a fixed rubric -- applies the same static dimensions regardless of task, producing systematic mis-evaluation. We present AdaRubric, a framework that (i) adaptively generates task-specific evaluation rubrics from task descriptions via LLM, (ii) evaluates agent trajectories step-by-step with confidence-weighted, per-dimension scoring, and (iii) produces dense reward signals for preference learning. Three composable filtering strategies, including the novel DimensionAwareFilter that provably prevents dimension-level quality masking, yield high-quality DPO preference pairs. On WebArena, ToolBench, and AgentBench, AdaRubric achieves Pearson r = 0.79 human correlation (+0.15 over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alphadl/AdaRubrics
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.