LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval
Narges Baba Ahmadi, Jan Strich, Martin Semmann, Chris Biemann

TL;DR
LEMUR introduces a large multilingual legal corpus and fine-tunes embedding models to significantly improve legal document retrieval across multiple languages, especially benefiting low-resource languages.
Contribution
The paper presents a new multilingual legal corpus and demonstrates how fine-tuning multilingual models enhances legal retrieval accuracy and cross-lingual transferability.
Findings
Fine-tuning improves Top-k retrieval accuracy across languages.
Legal-domain fine-tuning benefits low-resource languages most.
Enhancements transfer to unseen languages, indicating content-level improvements.
Abstract
Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We quantify the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions using the Lexical Content Score (LCS). Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Law · Computational and Text Analysis Methods
