MultiEURLEX -- A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer
Ilias Chalkidis, Manos Fergadiotis, Ion Androutsopoulos

TL;DR
This paper introduces MULTI-EURLEX, a large multilingual legal dataset for topic classification, and investigates zero-shot cross-lingual transfer, highlighting challenges like catastrophic forgetting and proposing adaptation strategies to improve performance.
Contribution
The paper presents MULTI-EURLEX, a new multilingual legal dataset, and evaluates strategies to enhance zero-shot cross-lingual transfer in legal document classification.
Findings
Fine-tuning in a single language causes catastrophic forgetting.
Adaptation strategies improve zero-shot transfer performance.
Model choice and label set size influence transfer success.
Abstract
We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. We highlight the effect of temporal concept drift and the importance of chronological, instead of random splits. We use the dataset as a testbed for zero-shot cross-lingual transfer, where we exploit annotated training documents in one language (source) to classify documents in another language (target). We find that fine-tuning a multilingually pretrained model (XLM-ROBERTA, MT5) in a single source language leads to catastrophic forgetting of multilingual knowledge and, consequently, poor zero-shot transfer to other languages. Adaptation strategies, namely partial fine-tuning, adapters, BITFIT, LNFIT, originally proposed to accelerate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Machine Learning in Healthcare
