A New Tool for Efficiently Generating Quality Estimation Datasets

Sugyeong Eo; Chanjun Park; Jaehyung Seo; Hyeonseok Moon; Heuiseok Lim

arXiv:2111.00767·cs.CL·November 2, 2021·1 cites

A New Tool for Efficiently Generating Quality Estimation Datasets

Sugyeong Eo, Chanjun Park, Jaehyung Seo, Hyeonseok Moon, Heuiseok Lim

PDF

Open Access

TL;DR

This paper introduces an automatic tool for generating quality estimation datasets using monolingual or parallel corpora, reducing manual effort and enabling data augmentation for improved QE performance.

Contribution

The paper presents a novel fully automatic pseudo-QE dataset generation tool that leverages monolingual and parallel corpora, facilitating inexpensive and scalable QE dataset creation.

Findings

01

Enhanced QE performance through data augmentation

02

Applicability across multiple language pairs

03

Public release of the dataset generation tool

Abstract

Building of data for quality estimation (QE) training is expensive and requires significant human labor. In this study, we focus on a data-centric approach while performing QE, and subsequently propose a fully automatic pseudo-QE dataset generation tool that generates QE datasets by receiving only monolingual or parallel corpus as the input. Consequently, the QE performance is enhanced either by data augmentation or by encouraging multiple language pairs to exploit the applicability of QE. Further, we intend to publicly release this user friendly QE dataset generation tool as we believe this tool provides a new, inexpensive method to the community for developing QE datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Biomedical Text Mining and Ontologies