CoCoA-MT: A Dataset and Benchmark for Contrastive Controlled MT with Application to Formality
Maria N\u{a}dejde, Anna Currey, Benjamin Hsu, Xing Niu, Marcello, Federico, Georgiana Dinu

TL;DR
This paper introduces CoCoA-MT, a dataset and benchmark for training machine translation models that can control the formality level of translations, addressing the need for contextually appropriate language use.
Contribution
The work presents a new annotated dataset and evaluation metric for formality-controlled MT, demonstrating effective fine-tuning methods with high accuracy across multiple languages.
Findings
Achieved 82% in-domain and 73% out-of-domain accuracy in formality control.
Maintained overall translation quality while controlling formality.
Provided a benchmark for future research in attribute-controlled machine translation.
Abstract
The machine translation (MT) task is typically formulated as that of returning a single translation for an input segment. However, in many cases, multiple different translations are valid and the appropriate translation may depend on the intended target audience, characteristics of the speaker, or even the relationship between speakers. Specific problems arise when dealing with honorifics, particularly translating from English into languages with formality markers. For example, the sentence "Are you sure?" can be translated in German as "Sind Sie sich sicher?" (formal register) or "Bist du dir sicher?" (informal). Using wrong or inconsistent tone may be perceived as inappropriate or jarring for users of certain cultures and demographics. This work addresses the problem of learning to control target language attributes, in this case formality, from a small amount of labeled contrastive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
