Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization
Yao Dou, Wei Xu

TL;DR
This paper introduces Gavel-Ref, a comprehensive evaluation framework for assessing large language models on complex long-context legal summarization tasks, revealing current limitations and proposing an autonomous agent approach to improve efficiency.
Contribution
The paper presents Gavel-Ref, a multi-value checklist evaluation framework for legal summarization, and Gavel-Agent, an autonomous tool-enhanced LLM system that reduces token usage with minimal performance loss.
Findings
Even the best models score only around 50 on Gavel-Ref.
Models excel at simple checklist items but struggle with complex ones.
Gavel-Agent reduces token usage by 36% with only 7% performance drop.
Abstract
Large language models (LLMs) now support contexts of up to 1M tokens, but their effectiveness on complex long-context tasks remains unclear. In this paper, we study multi-document legal case summarization, where a single case often spans many documents totaling 100K-500K tokens. We introduce Gavel-Ref, a reference-based evaluation framework with multi-value checklist evaluation over 26 items, as well as residual fact and writing-style evaluations. Using Gavel-Ref, we go beyond the single aggregate scores reported in prior work and systematically evaluate 12 frontier LLMs on 100 legal cases ranging from 32K to 512K tokens, primarily from 2025. Our results show that even the strongest model, Gemini 2.5 Pro, achieves only around 50 of , highlighting the difficulty of the task. Models perform well on simple checklist items (e.g., filing date) but struggle on…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper conducts a systematic evaluation of 12 frontier LLMs across five context length scales (32K–512K tokens), using predominantly 2025 cases (90% from 2025). This design offers credible insights into how state-of-the-art models process truly long contexts. Instead of reporting only aggregate performance scores, the study provides fine-grained, item-level analyses, revealing that top-performing models achieve near-perfect accuracy on simple single-value items (e.g., filing date: 0.99). Th
The meta-evaluation in Section 2.3 is based on only 20 long summaries for checklist validation, with 15 receiving single annotations. Such a limited sample may fail to capture the full spectrum of edge cases and annotation disagreements across the 50 diverse cases evaluated in the main study. The relatively low inter-annotator agreement for certain tasks (e.g., Krippendorff’s $\alpha$ = 0.32 for style ratings) questions the reliability of these annotations as ground-truth references. Although
I love that the evaluations are very comprehensive, spanning across 12 frontier models. In addition, I appreciate your efforts in conducting a relatively large scale human annotation effort, which enhances the rigor of this study. There are also in-depth analyses of the failure modes and how top models succeed in this task, which gives us more insights.
My biggest issue with the paper is that the contribution feels very incremental. The main method is a follow-up on an existing paper (https://arxiv.org/pdf/2506.01241) which is not a popular and widespread evaluation method as of today. Also, the existing method from Ruan et al. (2025) relied on 26 items from the legal experts in the study, and it really feels like the premise of Ruan et al. (2025) needs to be more general and robust. In addition, this paper presents a low inter-annotator agreem
- Checklist-based evaluation is a reliable option for certain technical domains such as finance and legal. This paper highlights key limitations with an existing checklist-based benchmark (ExpertLongBench) and proposes fixes through multi-value answers and residual facts. - The paper also explores the idea of extracting answers to the checklist directly from source documents. This is especially important for long input tasks because human written checklists are expensive to collect (if feasible)
- The paper needs additional discussion of the results from Figure 2. Its unclear (and a bit surprising) why the models perform better at longer inputs than shorter inputs. This seems counter-intuitive and some qualitative analysis here could be helpful. - The proposed agent-based method significantly underperforms a strong long-context model (GPT-4.1). To show that the agent approach is reliable, it would be interesting to explore stronger models within the agent setup. I understand the argumen
The authors defined a reference-based evaluation framework for comprehensively assessing legal summarization with checklist-style, residual facts, as well as writing style evaluations. This allows for more nuanced evaluation beyond a single-scaler score.
- The amount of the cases (50) used in the evaluation is very limited. It is also unclear how such cases are selected. Given so many different legal areas, types of legal documents, juridictions etc., the representativeness of the 50 cases is questionable. For instance, a "good" summary for a contract law case could look very differently to a criminal law case. - Judgement from LLMs are used extensively in GAVEL-REF framework, yet it's unclear which model(s) are used in which step. There is no
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Law · Computational and Text Analysis Methods
