DaLC: Domain Adaptation Learning Curve Prediction for Neural Machine Translation
Cheonbok Park, Hantae Kim, Ioan Calapodescu, Hyunchang Cho, and, Vassilina Nikoulina

TL;DR
This paper introduces DaLC, a model that predicts the benefits of domain adaptation in neural machine translation using only monolingual data, aiding resource investment decisions.
Contribution
It presents a novel approach to forecast domain adaptation performance from monolingual data, utilizing encoder representations and features at the instance level.
Findings
Instance-level features outperform corpus-level features in domain distinction.
DaLC can predict DA performance without parallel data.
Analysis reveals limitations and future directions for DA prediction.
Abstract
Domain Adaptation (DA) of Neural Machine Translation (NMT) model often relies on a pre-trained general NMT model which is adapted to the new domain on a sample of in-domain parallel data. Without parallel data, there is no way to estimate the potential benefit of DA, nor the amount of parallel samples it would require. It is however a desirable functionality that could help MT practitioners to make an informed decision before investing resources in dataset creation. We propose a Domain adaptation Learning Curve prediction (DaLC) model that predicts prospective DA performance based on in-domain monolingual samples in the source language. Our model relies on the NMT encoder representations combined with various instance and corpus-level features. We demonstrate that instance-level is better able to distinguish between different domains compared to corpus-level frameworks proposed in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Domain Adaptation and Few-Shot Learning
