The Tatoeba Translation Challenge -- Realistic Data Sets for Low   Resource and Multilingual MT

J\"org Tiedemann

arXiv:2010.06354·cs.CL·October 14, 2020

The Tatoeba Translation Challenge -- Realistic Data Sets for Low Resource and Multilingual MT

J\"org Tiedemann

PDF

1 Repo 1 Models

TL;DR

This paper introduces a comprehensive benchmark dataset for machine translation covering over 500 languages and thousands of language pairs, aiming to promote development of inclusive translation models for low-resource and multilingual scenarios.

Contribution

It provides a large, diverse, and systematically annotated dataset collection along with baseline models to advance low-resource and multilingual machine translation research.

Findings

01

First comprehensive multilingual dataset with systematic annotations.

02

Enables realistic low-resource translation experiments.

03

Provides baseline models for diverse language pairs.

Abstract

This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs covering over 500 languages and tools for creating state-of-the-art translation models from that collection. The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages. Using the package it is possible to work on realistic low-resource scenarios avoiding artificially reduced setups that are common when demonstrating zero-shot or few-shot learning. For the first time, this package provides a comprehensive collection of diverse data sets in hundreds of languages with systematic language and script annotation and data splits to extend the narrow coverage of existing benchmarks. Together with the data release, we also provide a growing number of pre-trained baseline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Helsinki-NLP/Tatoeba-Challenge
noneOfficial

Models

🤗
cenfis/turemb_512
model· 5 dl· ♡ 3
5 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.