Lawma: The Power of Specialization for Legal Annotation
Ricardo Dominguez-Olmedo, Vedant Nanda, Rediet Abebe, Stefan Bechtold,, Christoph Engel, Jens Frankenreiter, Krishna Gummadi, Moritz Hardt, Michael, Livermore

TL;DR
This paper introduces CaselawQA, a benchmark for legal annotation, and shows that fine-tuned open-source models outperform commercial large language models like GPT-4.5 in legal text annotation tasks.
Contribution
It presents a new benchmark for legal annotation and demonstrates the effectiveness of fine-tuned open-source models over commercial models.
Findings
Commercial models have variable accuracy in legal annotation.
Fine-tuned models outperform commercial models with limited labeled data.
A few hundred labeled examples suffice for higher accuracy.
Abstract
Annotation and classification of legal text are central components of empirical legal research. Traditionally, these tasks are often delegated to trained research assistants. Motivated by the advances in language modeling, empirical legal scholars are increasingly turning to prompting commercial models, hoping that it will alleviate the significant cost of human annotation. Despite growing use, our understanding of how to best utilize large language models for legal annotation remains limited. To bridge this gap, we introduce CaselawQA, a benchmark comprising 260 legal annotation tasks, nearly all new to the machine learning community. We demonstrate that commercial models, such as GPT-4.5 and Claude 3.7 Sonnet, achieve non-trivial yet highly variable accuracy, generally falling short of the performance required for legal work. We then demonstrate that small, lightly fine-tuned models…
Peer Reviews
Decision·ICLR 2025 Poster
The strengths include: - Carefully and thoroughly done experimentation. Even when I disagreed with the choices they made for how to measure things in the main paper, the way I would have done things is usually available in the lengthy appendices. The paper reads as comprehensive, not rushed. - A valuable new benchmark dataset, with nearly all new tasks rather than just collating existing tasks. - The tasks of the dataset are derived from a pre-existing database by programmatic means. This is a s
- The tasks of the dataset are derived from a pre-existing database by programmatic means. This is a strength since they are annotations that have been built up by lawyers and political scientists and so have ecological validity, but a weakness in that they were in a sense pre-existing rather than this being a major contribution of new labeling. - The paper is not very original: There is no new machine learning, there are several pre-existing benchmark legal datasets to which this adds another o
- Legal text processing has economic value but is difficult for most state of the art LLMs. - The authors assemble a large dataset of real-world legal documents (via querying 3rd party services), annotated with diverse question-answering tasks. - They show the effectiveness of fine-tuning on this dataset, over few-shot learning. They report performance of various tuning configurations.
- There is little exploration of the relative difficulty of the different types of tasks they introduce—it would be useful to know more about which tasks that GPT-4 outperformed on, and where fine-tuning had minimal vs. substantial gains. - The paper is awkwardly organized, such as “limitations” abruptly inserted before the main contributions are outlined. - “The costs and error of existing methods is the single most important bottleneck in the empirical legal studies pipeline.” (39-40) is vague
- The data proposed seems very useful and covers a broad range - The analysis evaluates a wide range of models and situations - The fine-tuning experiments show large gains can be had with domain-specific specialization. - The appendix has a lot of good information about where the data came from and their inter annotator agreements
- It is somewhat unclear how the authors created these tasks, in terms of how the questions were designed. E.g. who wrote the explanation for each of the legal variables provided by USCAD? How are the authors sure that these are accurate representations of the classification? - [Minor] some of these tasks are pretty niceh/easy ("What state is associated with the respondent") but again this comes from using the variables in some schema.
Code & Models
Videos
Taxonomy
TopicsComparative and International Law Studies · European and International Contract Law · Legal Education and Practice Innovations
MethodsAttention Is All You Need · Adam · Label Smoothing · Linear Layer · Byte Pair Encoding · Layer Normalization · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dense Connections
