FAME-MT Dataset: Formality Awareness Made Easy for Machine Translation   Purposes

Dawid Wi\'sniewski; Zofia Rostek; Artur Nowakowski

arXiv:2405.11942·cs.CL·May 21, 2024

FAME-MT Dataset: Formality Awareness Made Easy for Machine Translation Purposes

Dawid Wi\'sniewski, Zofia Rostek, Artur Nowakowski

PDF

Open Access 1 Repo

TL;DR

The paper introduces FAME-MT, a large dataset of 11.2 million European language translations labeled for formality, enabling better control of formality levels in machine translation models.

Contribution

It provides the largest dataset of formality annotations for machine translation, along with a proof-of-concept model to steer translation formality levels.

Findings

01

FAME-MT is the largest formality-annotated translation dataset.

02

The dataset is reliable for language register information.

03

A proof-of-concept model demonstrates controlled formality in translations.

Abstract

People use language for various purposes. Apart from sharing information, individuals may use it to express emotions or to show respect for another person. In this paper, we focus on the formality level of machine-generated translations and present FAME-MT -- a dataset consisting of 11.2 million translations between 15 European source languages and 8 European target languages classified to formal and informal classes according to target sentence formality. This dataset can be used to fine-tune machine translation models to ensure a given formality level for each European target language considered. We describe the dataset creation procedure, the analysis of the dataset's quality showing that FAME-MT is a reliable source of language register information, and we present a publicly available proof-of-concept machine translation model that uses the dataset to steer the formality level of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

laniqo-public/fame-mt
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsFocus