The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages
Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaz, Erjavec, Dan Tufis, Daniel Varga

TL;DR
The JRC-Acquis corpus is a comprehensive, multilingual, and aligned legal text dataset covering over 20 EU languages, designed for cross-language research, classification, and benchmarking of language processing tools.
Contribution
It provides a large, freely available parallel corpus with alignment and classification data across many languages, enabling advanced multilingual NLP research and tool evaluation.
Findings
Contains nearly 8,000 documents per language
Includes alignment data for over 190 language pairs
Supports multi-label classification and keyword assignment
Abstract
We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EUanguages, with additional documents being available in the languages of the EU candidate countries. The corpus consists of almost 8,000 documents per language, with an average size of nearly 9 million words per language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is available for all 190+ language pair combinations. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and keyword-assignment software. The corpus is encoded in XML, according to the Text Encoding Initiative Guidelines. Due to the large number of parallel texts in many languages, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression
