FGraDA: A Dataset and Benchmark for Fine-Grained Domain Adaptation in   Machine Translation

Wenhao Zhu; Shujian Huang; Tong Pu; Pingxuan Huang; Xu Zhang; Jian Yu,; Wei Chen; Yanfeng Wang; Jiajun Chen

arXiv:2012.15717·cs.CL·November 9, 2021·1 cites

FGraDA: A Dataset and Benchmark for Fine-Grained Domain Adaptation in Machine Translation

Wenhao Zhu, Shujian Huang, Tong Pu, Pingxuan Huang, Xu Zhang, Jian Yu,, Wei Chen, Yanfeng Wang, Jiajun Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces FGraDA, a dataset and benchmark for fine-grained domain adaptation in machine translation, focusing on sub-domains with limited resources and no in-domain training data, highlighting ongoing challenges.

Contribution

The paper presents a new dataset and benchmark for fine-grained domain adaptation in MT, emphasizing resource scarcity and heterogeneity in real-world scenarios.

Findings

01

Significant performance gaps remain in fine-grained domain adaptation.

02

Heterogeneous resources pose challenges for current MT models.

03

The dataset enables targeted evaluation of domain-specific translation issues.

Abstract

Previous research for adapting a general neural machine translation (NMT) model into a specific domain usually neglects the diversity in translation within the same domain, which is a core problem for domain adaptation in real-world scenarios. One representative of such challenging scenarios is to deploy a translation system for a conference with a specific topic, e.g., global warming or coronavirus, where there are usually extremely less resources due to the limited schedule. To motivate wider investigation in such a scenario, we present a real-world fine-grained domain adaptation task in machine translation (FGraDA). The FGraDA dataset consists of Chinese-English translation task for four sub-domains of information technology: autonomous vehicles, AI education, real-time networks, and smart phone. Each sub-domain is equipped with a development set and test set for evaluation purposes.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

owennju/fgrada
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications