Translation of Multifaceted Data without Re-Training of Machine   Translation Systems

Hyeonseok Moon; Seungyoon Lee; Seongtae Hong; Seungjun Lee; Chanjun; Park; Heuiseok Lim

arXiv:2404.16257·cs.CL·December 3, 2024

Translation of Multifaceted Data without Re-Training of Machine Translation Systems

Hyeonseok Moon, Seungyoon Lee, Seongtae Hong, Seungjun Lee, Chanjun, Park, Heuiseok Lim

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel machine translation pipeline that preserves intra-data relations by concatenating data components, improving translation quality and downstream task performance without re-training existing systems.

Contribution

It proposes a new MT approach using Catalyst Statement and Indicator Token to maintain intra-data relations, enhancing translation and training data effectiveness.

Findings

01

Improved translation quality over conventional methods.

02

Enhanced downstream task performance in WPR and QG.

03

No re-training of existing MT systems required.

Abstract

Translating major language resources to build minor language resources becomes a widely-used approach. Particularly in translating complex data points composed of multiple components, it is common to translate each component separately. However, we argue that this practice often overlooks the interrelation between components within the same data point. To address this limitation, we propose a novel MT pipeline that considers the intra-data relation in implementing MT for training data. In our MT pipeline, all the components in a data point are concatenated to form a single translation sequence and subsequently reconstructed to the data components after translation. We introduce a Catalyst Statement (CS) to enhance the intra-data relation, and Indicator Token (IT) to assist the decomposition of a translated sequence into its respective data components. Through our approach, we have…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Translation of Multifaceted Data without Re-Training of Machine Translation Systems· underline

Taxonomy

TopicsNatural Language Processing Techniques