ChatLang-8: An LLM-Based Synthetic Data Generation Framework for   Grammatical Error Correction

Jeiyoon Park; Chanjun Park; Heuiseok Lim

arXiv:2406.03202·cs.CL·June 12, 2024·1 cites

ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction

Jeiyoon Park, Chanjun Park, Heuiseok Lim

PDF

Open Access

TL;DR

This paper presents ChatLang-8, a framework leveraging large language models to generate diverse, high-quality synthetic data for grammatical error correction, significantly improving model performance and dataset variability.

Contribution

The paper introduces a novel automated framework and a new dataset, ChatLang-8, for generating diverse GEC data using LLMs, enhancing data quality and model training.

Findings

01

ChatLang-8 has 1 million human-like error pairs.

02

Models trained on ChatLang-8 outperform those trained on existing datasets.

03

ChatLang-8 exhibits more uniform pattern diversity.

Abstract

We explore and improve the capabilities of LLMs to generate data for grammatical error correction (GEC). When merely producing parallel sentences, their patterns are too simplistic to be valuable as a corpus. To address this issue, we propose an automated framework that includes a Subject Selector, Grammar Selector, Prompt Manager, and Evaluator. Additionally, we introduce a new dataset for GEC tasks, named ChatLang-8, which encompasses eight types of subject nouns and 23 types of grammar. It consists of 1 million pairs featuring human-like grammatical errors. Our experiments reveal that ChatLang-8 exhibits a more uniform pattern composition compared to existing GEC datasets. Furthermore, we observe improved model performance when using ChatLang-8 instead of existing GEC datasets. The experimental results suggest that our framework and ChatLang-8 are valuable resources for enhancing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling