PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

Afra Feyza Aky\"urek; Advait Gosai; Chen Bo Calvin Zhang; Vipul Gupta; Jaehwan Jeong; Anisha Gunjal; Tahseen Rabbani; Maria Mazzone; David Randolph; Mohammad Mahmoudi Meymand; Gurshaan Chattha; Paula Rodriguez; Diego Mares; Pavit Singh; Michael Liu; Subodh Chawla; Pete Cline; Lucy Ogaz; Ernesto Hernandez; Zihao Wang; Pavi Bhatter; Marcos Ayestaran; Bing Liu; Yunzhong He

arXiv:2511.11562·cs.CL·November 17, 2025

PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

Afra Feyza Aky\"urek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, Mohammad Mahmoudi Meymand, Gurshaan Chattha, Paula Rodriguez, Diego Mares, Pavit Singh, Michael Liu, Subodh Chawla, Pete Cline

PDF

Open Access

TL;DR

PRBench is a large, open-source benchmark with expert-curated tasks in Finance and Law, designed to evaluate AI models on real-world, high-stakes professional reasoning, revealing significant gaps in current model capabilities.

Contribution

The paper introduces PRBench, the largest public rubric-based benchmark for legal and finance domains, with expert-authored tasks and rigorous validation, enabling detailed assessment of AI performance in professional contexts.

Findings

01

Top model scores are only around 0.39 and 0.37, indicating substantial room for improvement.

02

Models often make inaccurate judgments and lack transparent reasoning.

03

Performance varies significantly across different capabilities and failure modes.

Abstract

Frontier model progress is often measured by academic benchmarks, which offer a limited view of performance in real-world professional contexts. Existing evaluations often fail to assess open-ended, economically consequential tasks in high-stakes domains like Legal and Finance, where practical returns are paramount. To address this, we introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it, to our knowledge, the largest public, rubric-based benchmark for both legal and finance domains. We recruit 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed tasks inspired by their actual workflows. This process yields significant diversity, with tasks spanning 114 countries and 47…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Artificial Intelligence in Law