Assessing the Performance Gap Between Lexical and Semantic Models for Information Retrieval With Formulaic Legal Language
Larissa Mori, Carlos Sousa de Oliveira, Yuehwern Yih, Mario Ventresca

TL;DR
This paper compares lexical and semantic retrieval models for legal texts, finding that lexical models excel with repetitive language, while fine-tuned semantic models outperform traditional baselines in specific scenarios.
Contribution
It provides a detailed analysis of when lexical versus semantic models are more effective for legal passage retrieval, especially considering the structured nature of legal language.
Findings
BM25 performs well in repetitive and longer query scenarios.
Fine-tuning dense models improves retrieval performance over BM25.
Repetitive legal language favors lexical models, while nuanced language benefits semantic models.
Abstract
Legal passage retrieval is an important task that assists legal practitioners in the time-intensive process of finding relevant precedents to support legal arguments. This study investigates the task of retrieving legal passages or paragraphs from decisions of the Court of Justice of the European Union (CJEU), whose language is highly structured and formulaic, leading to repetitive patterns. Understanding when lexical or semantic models are more effective at handling the repetitive nature of legal language is key to developing retrieval systems that are more accurate, efficient, and transparent for specific legal domains. To this end, we explore when this routinized legal language is better suited for retrieval using methods that rely on lexical and statistical features, such as BM25, or dense retrieval models trained to capture semantic and contextual information. A qualitative and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Natural Language Processing Techniques · Topic Modeling
