CaseSumm: A Large-Scale Dataset for Long-Context Summarization from U.S.   Supreme Court Opinions

Mourad Heddaya; Kyle MacMillan; Anup Malani; Hongyuan Mei; Chenhao Tan

arXiv:2501.00097·cs.CL·January 3, 2025

CaseSumm: A Large-Scale Dataset for Long-Context Summarization from U.S. Supreme Court Opinions

Mourad Heddaya, Kyle MacMillan, Anup Malani, Hongyuan Mei, Chenhao Tan

PDF

Open Access 1 Datasets 1 Video

TL;DR

CaseSumm introduces the largest legal case summarization dataset with long-context opinions and evaluates LLMs, revealing limitations of automatic metrics and emphasizing the importance of human assessment.

Contribution

The paper presents a new large-scale legal summarization dataset and provides a comprehensive evaluation of LLM-generated summaries, highlighting challenges in automatic evaluation methods.

Findings

01

Mistral 7b outperforms larger models on automatic metrics

02

Human evaluators find GPT-4 summaries clearer and more accurate

03

Automatic metrics do not correlate well with human judgments

Abstract

This paper introduces CaseSumm, a novel dataset for long-context summarization in the legal domain that addresses the need for longer and more complex datasets for summarization evaluation. We collect 25.6K U.S. Supreme Court (SCOTUS) opinions and their official summaries, known as "syllabuses." Our dataset is the largest open legal case summarization dataset, and is the first to include summaries of SCOTUS decisions dating back to 1815. We also present a comprehensive evaluation of LLM-generated summaries using both automatic metrics and expert human evaluation, revealing discrepancies between these assessment methods. Our evaluation shows Mistral 7b, a smaller open-source model, outperforms larger models on most automatic metrics and successfully generates syllabus-like summaries. In contrast, human expert annotators indicate that Mistral summaries contain hallucinations. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ChicagoHAI/CaseSumm
dataset· 324 dl
324 dl

Videos

CaseSumm: A Large-Scale Dataset for Long-Context Summarization from U.S. Supreme Court Opinions· underline

Taxonomy

TopicsComputational and Text Analysis Methods · Topic Modeling · Advanced Text Analysis Techniques

MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Absolute Position Encodings · Dropout · Linear Layer · Softmax · Adam · Residual Connection · Multi-Head Attention