ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists
Jie Ruan, Inderjeet Nair, Shuyang Cao, Amy Liu, Sheza Munir, Micah Pollens-Dempsey, Tiffany Chiang, Lucy Kates, Nicholas David, Sihan Chen, Ruxin Yang, Yuqian Yang, Jasmine Gump, Tessa Bialek, Vivek Sankaran, Margo Schlanger, Lu Wang

TL;DR
ExpertLongBench is a comprehensive benchmark with expert-validated tasks requiring long-form outputs, and CLEAR is a new evaluation framework that enables fine-grained, grounded assessment of large language models' performance on these tasks.
Contribution
The paper introduces ExpertLongBench, a new expert-level benchmark with structured rubrics and a novel evaluation framework, CLEAR, for assessing long-form language model outputs.
Findings
Existing LLMs perform poorly on expert-level tasks, with top scores around 33.4 F1.
Models can generate relevant content but lack correctness.
Open-weight models can effectively extract and compare checklists for evaluation.
Abstract
This paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications. Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task in ExpertLongBench includes a rubric, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR, an evaluation framework that supports accurate evaluation of long-form model outputs in our benchmark. To achieve fine-grained, expert-aligned evaluation, CLEAR derives checklists from both model outputs and references by extracting information corresponding to items in the task-specific rubric. Checklist items of model outputs are then compared with corresponding…
Peer Reviews
Decision·ICLR 2026 Poster
1. This work fills the gap in expert-level long-form generation evaluation, which appears to be of considerable significance. 2. This work proposes CLEAR, a method that extracts checklists from reference outputs for comparison. This approach exhibits higher accuracy compared to existing rubric-based methods, and the experiments also demonstrate that models may potentially game such rubric-based approaches.
1. The long-form evaluation method is excessively costly. As shown in Table 1, many tasks involve input exceeding 100,000 tokens. For a single task with 100 samples, 10 million tokens are required—and this only accounts for input, excluding the cost of completion. The authors put their cost in Section G.3, where the evaluation costs them over $1000. The practicality of evaluation at this scale is questionable. 2. In terms of evaluation, while it assesses long-form generation capabilities, it o
1. The authors present a benchmark with long inputs and outputs that requires expert-level knowledge, covering 11 domains. This is a valuable resource for the community. 2. The authors conduct comprehensive experiments on ExpertLongBench, including many frontier models, and show that the benchmark remains highly challenging even for top-performing models. 3. The authors devote substantial effort to data quality and model analysis, for example examining potential data contamination and designin
1. **Checklist and rubric reliability.** My main concern is the quality of the checklists and rubrics, which the overall evaluation heavily relies on. While I acknowledge the authors’ efforts, many tasks appear to be curated by a single PhD student in a related field, which may be insufficient and could introduce bias. An ideal setup would involve multiple experts for curation, additional experts for independent review, and iterative updates until the experts reach a reliable agreement. I only n
1. The paper addresses a critical and widely acknowledged gap in LLM evaluation. Most benchmarks test for factual recall via multiple-choice or short-form answers. This work makes a significant contribution by shifting the evaluation to what experts actually do. 2. The CLEAR framework is a major strength. The idea of not comparing two long texts directly, but instead using an expert-designed rubric to decompose the task into a structured checklist, is a very strong and novel approach to "ground
1. A weakness is the small number of samples. The benchmark contains 1050 samples total, about 100 samples/each. It is questionable whether 100 samples are sufficient to draw robust conclusions about a model's performance across an entire complex domain like law or medicine. This small scale could lead to noisy results. 2. The paper states that a portion of the benchmark is public. Given the small sample size (50-100 per task) and the highly unique, long-form nature of the prompts and data, the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFast Attention Via Positive Orthogonal Random Features · Performer
