CoCoA-MT: A Dataset and Benchmark for Contrastive Controlled MT with   Application to Formality

Maria N\u{a}dejde; Anna Currey; Benjamin Hsu; Xing Niu; Marcello; Federico; Georgiana Dinu

arXiv:2205.04022·cs.CL·May 10, 2022

CoCoA-MT: A Dataset and Benchmark for Contrastive Controlled MT with Application to Formality

Maria N\u{a}dejde, Anna Currey, Benjamin Hsu, Xing Niu, Marcello, Federico, Georgiana Dinu

PDF

Open Access 2 Repos

TL;DR

This paper introduces CoCoA-MT, a dataset and benchmark for training machine translation models that can control the formality level of translations, addressing the need for contextually appropriate language use.

Contribution

The work presents a new annotated dataset and evaluation metric for formality-controlled MT, demonstrating effective fine-tuning methods with high accuracy across multiple languages.

Findings

01

Achieved 82% in-domain and 73% out-of-domain accuracy in formality control.

02

Maintained overall translation quality while controlling formality.

03

Provided a benchmark for future research in attribute-controlled machine translation.

Abstract

The machine translation (MT) task is typically formulated as that of returning a single translation for an input segment. However, in many cases, multiple different translations are valid and the appropriate translation may depend on the intended target audience, characteristics of the speaker, or even the relationship between speakers. Specific problems arise when dealing with honorifics, particularly translating from English into languages with formality markers. For example, the sentence "Are you sure?" can be translated in German as "Sind Sie sich sicher?" (formal register) or "Bist du dir sicher?" (informal). Using wrong or inconsistent tone may be perceived as inappropriate or jarring for users of certain cultures and demographics. This work addresses the problem of learning to control target language attributes, in this case formality, from a small amount of labeled contrastive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification