LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval

Narges Baba Ahmadi; Jan Strich; Martin Semmann; Chris Biemann

arXiv:2602.09570·cs.CL·February 11, 2026

LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval

Narges Baba Ahmadi, Jan Strich, Martin Semmann, Chris Biemann

PDF

Open Access 1 Datasets

TL;DR

LEMUR introduces a large multilingual legal corpus and fine-tunes embedding models to significantly improve legal document retrieval across multiple languages, especially benefiting low-resource languages.

Contribution

The paper presents a new multilingual legal corpus and demonstrates how fine-tuning multilingual models enhances legal retrieval accuracy and cross-lingual transferability.

Findings

01

Fine-tuning improves Top-k retrieval accuracy across languages.

02

Legal-domain fine-tuning benefits low-resource languages most.

03

Enhancements transfer to unseen languages, indicating content-level improvements.

Abstract

Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We quantify the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions using the Lexical Content Score (LCS). Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

G4KMU/LEMUR
dataset· 17k dl
17k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Law · Computational and Text Analysis Methods