Can Small Models Reason About Legal Documents? A Comparative Study
Snehit Vaddi

TL;DR
This study evaluates small language models' ability to reason about legal documents, finding that certain 3B-9B models can match or surpass larger models like GPT-4o-mini in specific legal tasks, with cost-effective cloud inference.
Contribution
It demonstrates that small, cost-efficient models can effectively perform legal reasoning tasks, highlighting the importance of architecture and prompting strategies over sheer size.
Findings
Mixture-of-Experts 3B model matches GPT-4o-mini in accuracy.
Few-shot prompting is most consistently effective across tasks.
Retrieval method (BM25 vs dense) has minimal impact on performance.
Abstract
Large language models show promise for legal applications, but deploying frontier models raises concerns about cost, latency, and data privacy. We evaluate whether sub-10B parameter models can serve as practical alternatives by testing nine models across three legal benchmarks (ContractNLI, CaseHOLD, and ECtHR) using five prompting strategies (direct, chain-of-thought, few-shot, BM25 RAG, and dense RAG). Across 405 experiments with three random seeds per configuration, we find that a Mixture-of-Experts model activating only 3B parameters matches GPT-4o-mini in mean accuracy while surpassing it on legal holding identification, and that architecture and training quality matter more than raw parameter count. Our largest model (9B parameters) performs worst overall. Chain-of-thought prompting proves sharply task-dependent, improving contract entailment but degrading multiple-choice legal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
