BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation
The Omnilingual MT Team, Pierre Andrews, Mikel Artetxe, Mariano Coria Meglioli, Marta R. Costa-juss\`a, Joe Chuang, David Dale, Cynthia Gao, Jean Maillard, Alex Mourachko, Christophe Ropers, Safiyyah Saleem, Eduardo S\'anchez, Ioannis Tsiamas, Arina Turkatenko

TL;DR
BOUQuET is a comprehensive multilingual dataset and benchmark designed to improve translation quality evaluation across diverse languages, domains, and register levels, supporting collaborative and crowd-sourced translation research.
Contribution
It introduces a new multi-way, multi-domain, and multi-register dataset in 8 non-English languages, facilitating more accurate and inclusive translation assessments.
Findings
Broader domain representation compared to existing datasets
Enables crowd-sourced extension for diverse languages
Simplifies translation tasks for non-experts
Abstract
BOUQuET is a multi-way, multicentric and multi-register/domain dataset and benchmark, and a broader collaborative initiative. This dataset is handcrafted in 8 non-English languages. Each of these source languages are representative of the most widely spoken ones and therefore they have the potential to serve as pivot languages that will enable more accurate translations. The dataset is multicentric to enforce representation of multilingual language features. In addition, the dataset goes beyond the sentence level, as it is organized in paragraphs of various lengths. Compared with related machine translation datasets, we show that BOUQuET has a broader representation of domains while simplifying the translation task for non-experts. Therefore, BOUQuET is specially suitable for crowd-source extension for which we are launching a call aiming at collecting a multi-way parallel corpus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
