ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
Zhilin Wang, Jaehun Jung, Ximing Lu, Shizhe Diao, Ellie Evans, Jiaqi Zeng, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong

TL;DR
ProfBench introduces a comprehensive evaluation benchmark for large language models in professional domains, highlighting their current limitations and disparities across models, while providing scalable evaluation tools.
Contribution
The paper presents a new multi-domain benchmark with human-annotated response criteria and robust evaluation methods to assess LLMs on professional tasks.
Findings
Top models achieve only 65.9% performance on ProfBench.
Significant performance gaps exist between proprietary and open models.
Extended thinking is crucial for tackling complex professional tasks.
Abstract
Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for…
Peer Reviews
Decision·ICLR 2026 Poster
- The tasks are realistic, complex, with grounding documents and created by domain experts, with reviewer feedback. - The LLM judges are evaluated in a clear way that takes also into account the bias towards the same model family. - Separate re-annotations show high inter-annotator agreement. Moreover, the LLM judge is highly reliable, with a tiny difference with human-annotated scores.
- GPT-5 already achieves a high score on Consulting (~80%), while Physics lags at ~49%. This suggests that one domain might saturate sooner than the other. - The set-up is text-only, even if tool use might be helpful for some tasks, e.g. calculators, spreadsheets, code etc. - Despite current difficulty, the text-only format may offer limited room for improvement as models become more capable. - Domain coverage is narrow, covering only two science and two business domains.
1. The benchmark covers multiple scientific domains and evaluates knowledge storage and complex reasoning capabilities. Its expert-designed criteria facilitate precise, granular assessment of LLM performance on challenging tasks. 2. The paper assesses a wide range of LLMs to provide comprehensive performance benchmarks. The experimental design encompasses comparisons across model accessibility (open-source and closed-source), scale, and reasoning capabilities. 3. The high-quality annotators grou
1. In Section 4, the LLM-as-judge is used as a binary classifier, with performance evaluated by F1 score. The target LLM is used to identify whether the provided criterion fulfills all the requirements to check the quality of the response. However, for such complex tasks, the F1 score only captures misalignment between the LLM and human experts. It does not reveal the LLM's internal understanding of the task or identify specific weaknesses. 2. In the rubric creation process from Section 3, the c
- A major strength is the exploration of LLMs as automated judges. Building on work in rubric-evaluation, the authors propose a framework to have LLMs determine if a given response satisfies each expert criterion. The framework aims to reduce self-enhancement bias (i.e., where LLM judges would favor responses from the same model or provider), as well as API costs. - ProfBench benchmarks LLMs as report generators in scenarios that mirror actual professional workflows, requiring multi-step reasoni
- The benchmark’s scoring relies on LLM judges, and while the authors do measure agreement with human annotators, it’s shown that the best judge isn’t near perfect (<80% Macro-F1 overall). There’re risks that LLM-judges might miss nuanced criteria fulfillment or penalize creative answers. The paper doesn’t deeply discuss failure modes of the LLM-judge. - ProfBench covers only four domains, leaving out several important domains of professional reasoning — notably, the legal, health, and engineeri
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Machine Learning in Materials Science · Advanced Graph Neural Networks
