The Cambridge Law Corpus: A Dataset for Legal AI Research
Andreas \"Ostling, Holli Sargeant, Huiyuan Xie, Ludwig Bull, and Alexander Terenin, Leif Jonsson, M{\aa}ns Magnusson, Felix, Steffek

TL;DR
The paper introduces the Cambridge Law Corpus, a large dataset of UK court cases spanning centuries, with annotations and benchmarks for legal AI research, emphasizing ethical considerations and restricted access.
Contribution
It provides the first comprehensive legal dataset with annotations and benchmarks, enabling advanced AI research in legal case outcome extraction.
Findings
GPT-3, GPT-4, and RoBERTa achieved baseline performance on case outcome extraction.
The dataset covers over 250,000 cases, including historical and recent legal texts.
Legal and ethical considerations are thoroughly discussed.
Abstract
We introduce the Cambridge Law Corpus (CLC), a dataset for legal AI research. It consists of over 250 000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases as old as the 16th century. This paper presents the first release of the corpus, containing the raw text and meta-data. Together with the corpus, we provide annotations on case outcomes for 638 cases, done by legal experts. Using our annotated data, we have trained and evaluated case outcome extraction with GPT-3, GPT-4 and RoBERTa models to provide benchmarks. We include an extensive legal and ethical discussion to address the potentially sensitive nature of this material. As a consequence, the corpus will only be released for research purposes under certain restrictions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Position-Wise Feed-Forward Layer · Linear Warmup With Linear Decay · Linear Layer · WordPiece · Label Smoothing · Absolute Position Encodings · Cosine Annealing · Transformer
