QUAK: A Synthetic Quality Estimation Dataset for Korean-English Neural   Machine Translation

Sugyeong Eo; Chanjun Park; Hyeonseok Moon; Jaehyung Seo; Gyeongmin; Kim; Jungseob Lee; Heuiseok Lim

arXiv:2209.15285·cs.CL·November 30, 2022

QUAK: A Synthetic Quality Estimation Dataset for Korean-English Neural Machine Translation

Sugyeong Eo, Chanjun Park, Hyeonseok Moon, Jaehyung Seo, Gyeongmin, Kim, Jungseob Lee, Heuiseok Lim

PDF

Open Access

TL;DR

This paper introduces QUAK, a large-scale synthetic Korean-English QE dataset created automatically to overcome manual data limitations, enabling scalable quality estimation research for neural machine translation.

Contribution

The paper presents a fully automatic method to generate large-scale Korean-English QE datasets, significantly reducing manual effort and enabling scalable quality estimation research.

Findings

01

Synthetic datasets improve QE performance with increased data scale.

02

QUAK datasets up to 1.58 million samples enhance QE results.

03

Automatic data generation reduces costs and facilitates language expansion.

Abstract

With the recent advance in neural machine translation demonstrating its importance, research on quality estimation (QE) has been steadily progressing. QE aims to automatically predict the quality of machine translation (MT) output without reference sentences. Despite its high utility in the real world, there remain several limitations concerning manual QE data creation: inevitably incurred non-trivial costs due to the need for translation experts, and issues with data scaling and language expansion. To tackle these limitations, we present QUAK, a Korean-English synthetic QE dataset generated in a fully automatic manner. This consists of three sub-QUAK datasets QUAK-M, QUAK-P, and QUAK-H, produced through three strategies that are relatively free from language constraints. Since each strategy requires no human effort, which facilitates scalability, we scale our data up to 1.58M for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Data Classification