JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks
Lanbo Lin, Jiayao Liu, Tianyuan Yang, Li Cai, Yuanwu Xu, Lei Wei, Sicong Xie, Guannan Zhang

TL;DR
JADE is a two-layer evaluation framework inspired by human experts that improves the assessment of open-ended professional tasks by balancing stability and flexibility, leading to better detection of agent failures and alignment with expert standards.
Contribution
JADE introduces a novel two-layer evaluation method combining stable skill-based criteria with dynamic, claim-level assessment for open-ended tasks.
Findings
Improves evaluation stability over LLM-only methods
Reveals critical failure modes missed by holistic evaluators
Successfully transfers to medical domain benchmarks
Abstract
Evaluating agentic AI on open-ended professional tasks faces a fundamental dilemma between rigor and flexibility. Static rubrics provide rigorous, reproducible assessment but fail to accommodate diverse valid response strategies, while LLM-as-a-judge approaches adapt to individual responses yet suffer from instability and bias. Human experts address this dilemma by combining domain-grounded principles with dynamic, claim-level assessment. Inspired by this process, we propose JADE, a two-layer evaluation framework. Layer 1 encodes expert knowledge as a predefined set of evaluation skills, providing stable evaluation criteria. Layer 2 performs report-specific, claim-level evaluation to flexibly assess diverse reasoning strategies, with evidence-dependency gating to invalidate conclusions built on refuted claims. Experiments on BizBench show that JADE improves evaluation stability and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Topic Modeling
