Unsupervised Clustering of Commercial Domains for Adaptive Machine   Translation

Mauro Cettolo; Mara Chinea Rios; Roldano Cattoni

arXiv:1612.04683·cs.CL·December 15, 2016

Unsupervised Clustering of Commercial Domains for Adaptive Machine Translation

Mauro Cettolo, Mara Chinea Rios, Roldano Cattoni

PDF

Open Access

TL;DR

This paper explores domain clustering techniques for adaptive machine translation, demonstrating that a specific distance metric enables effective performance with fewer, larger domain clusters.

Contribution

It compares five distance measures in hierarchical clustering for MT domain adaptation, identifying the most effective one for maintaining translation quality.

Findings

01

The most expensive distance metric yields better MT performance with fewer clusters.

02

Hierarchical clustering effectively groups commercial domains for adaptive MT.

03

Intrinsic and extrinsic evaluations confirm the superiority of the selected distance.

Abstract

In this paper, we report on domain clustering in the ambit of an adaptive MT architecture. A standard bottom-up hierarchical clustering algorithm has been instantiated with five different distances, which have been compared, on an MT benchmark built on 40 commercial domains, in terms of dendrograms, intrinsic and extrinsic evaluations. The main outcome is that the most expensive distance is also the only one able to allow the MT engine to guarantee good performance even with few, but highly populated clusters of domains.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Web Data Mining and Analysis