LLM-based Embedders for Prior Case Retrieval
Damith Premasiri, Tharindu Ranasinghe, Ruslan Mitkov

TL;DR
This paper explores the use of large language model-based text embedders for prior case retrieval in legal systems, demonstrating their effectiveness over traditional IR methods and supervised models across multiple benchmarks.
Contribution
It introduces LLM-based embedders for PCR, overcoming input length and data scarcity issues, and shows they outperform existing methods on benchmark datasets.
Findings
LLM-based embedders outperform BM25 and supervised models in PCR.
They effectively handle lengthy legal texts without truncation.
Unsupervised LLM embedders perform well across multiple datasets.
Abstract
In common law systems, legal professionals such as lawyers and judges rely on precedents to build their arguments. As the volume of cases has grown massively over time, effectively retrieving prior cases has become essential. Prior case retrieval (PCR) is an information retrieval (IR) task that aims to automatically identify the most relevant court cases for a specific query from a large pool of potential candidates. While IR methods have seen several paradigm shifts over the last few years, the vast majority of PCR methods continue to rely on traditional IR methods, such as BM25. The state-of-the-art deep learning IR methods have not been successful in PCR due to two key challenges: i. Lengthy legal text limitation; when using the powerful BERT-based transformer models, there is a limit of input text lengths, which inevitably requires to shorten the input via truncation or division…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsService-Oriented Architecture and Web Services · Data Quality and Management · Semantic Web and Ontologies
