Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction
Chanjun Park, Seonmin Koo, Seolhwa Lee, Jaehyung Seo, Sugyeong Eo,, Hyeonseok Moon, Heuiseok Lim

TL;DR
This study investigates the effects of data quality control methods on grammatical error correction models trained with synthetic versus real-world data, revealing positive impacts with real data but negative impacts with synthetic data.
Contribution
It provides the first thorough comparison of data quality control effects on models trained exclusively with synthetic data versus real-world data in GEC tasks.
Findings
Data quality control improves real-data trained models
Negative impact of data quality control on synthetic-data trained models
Highlights limitations of synthetic data in data-centric AI approaches
Abstract
Data-centric AI approach aims to enhance the model performance without modifying the model and has been shown to impact model performance positively. While recent attention has been given to data-centric AI based on synthetic data, due to its potential for performance improvement, data-centric AI has long been exclusively validated using real-world data and publicly available benchmark datasets. In respect of this, data-centric AI still highly depends on real-world data, and the verification of models using synthetic data has not yet been thoroughly carried out. Given the challenges above, we ask the question: Does data quality control (noise injection and balanced data), a data-centric AI methodology acclaimed to have a positive impact, exhibit the same positive impact in models trained solely with synthetic data? To address this question, we conducted comparative analyses between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Data Quality and Management
