Unsupervised Clustering of Commercial Domains for Adaptive Machine Translation
Mauro Cettolo, Mara Chinea Rios, Roldano Cattoni

TL;DR
This paper explores domain clustering techniques for adaptive machine translation, demonstrating that a specific distance metric enables effective performance with fewer, larger domain clusters.
Contribution
It compares five distance measures in hierarchical clustering for MT domain adaptation, identifying the most effective one for maintaining translation quality.
Findings
The most expensive distance metric yields better MT performance with fewer clusters.
Hierarchical clustering effectively groups commercial domains for adaptive MT.
Intrinsic and extrinsic evaluations confirm the superiority of the selected distance.
Abstract
In this paper, we report on domain clustering in the ambit of an adaptive MT architecture. A standard bottom-up hierarchical clustering algorithm has been instantiated with five different distances, which have been compared, on an MT benchmark built on 40 commercial domains, in terms of dendrograms, intrinsic and extrinsic evaluations. The main outcome is that the most expensive distance is also the only one able to allow the MT engine to guarantee good performance even with few, but highly populated clusters of domains.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Web Data Mining and Analysis
