Synthetic Alone: Exploring the Dark Side of Synthetic Data for   Grammatical Error Correction

Chanjun Park; Seonmin Koo; Seolhwa Lee; Jaehyung Seo; Sugyeong Eo,; Hyeonseok Moon; Heuiseok Lim

arXiv:2306.14377·cs.CL·June 27, 2023

Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction

Chanjun Park, Seonmin Koo, Seolhwa Lee, Jaehyung Seo, Sugyeong Eo,, Hyeonseok Moon, Heuiseok Lim

PDF

Open Access

TL;DR

This study investigates the effects of data quality control methods on grammatical error correction models trained with synthetic versus real-world data, revealing positive impacts with real data but negative impacts with synthetic data.

Contribution

It provides the first thorough comparison of data quality control effects on models trained exclusively with synthetic data versus real-world data in GEC tasks.

Findings

01

Data quality control improves real-data trained models

02

Negative impact of data quality control on synthetic-data trained models

03

Highlights limitations of synthetic data in data-centric AI approaches

Abstract

Data-centric AI approach aims to enhance the model performance without modifying the model and has been shown to impact model performance positively. While recent attention has been given to data-centric AI based on synthetic data, due to its potential for performance improvement, data-centric AI has long been exclusively validated using real-world data and publicly available benchmark datasets. In respect of this, data-centric AI still highly depends on real-world data, and the verification of models using synthetic data has not yet been thoroughly carried out. Given the challenges above, we ask the question: Does data quality control (noise injection and balanced data), a data-centric AI methodology acclaimed to have a positive impact, exhibit the same positive impact in models trained solely with synthetic data? To address this question, we conducted comparative analyses between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Data Quality and Management