Multi-Dimensional Evaluation of LLMs for Grammatical Error Correction
Adnan Labib, Qiao Wang, Yixuan Huang, Zheng Yuan

TL;DR
This paper evaluates the performance of recent Large Language Models in grammatical error correction, analyzing their strengths, limitations, and the adequacy of current evaluation metrics, and provides insights for educational applications.
Contribution
It introduces a comprehensive multi-dimensional evaluation of LLMs for GEC, demonstrating GPT-4o's state-of-the-art performance and highlighting the limitations of reference-based metrics.
Findings
GPT-4o achieves state-of-the-art performance in GEC.
Individual LLMs show highly similar error correction patterns.
Reference-based metrics underestimate GEC system performance.
Abstract
Automated assistants for Grammatical Error Correction are now embedded in educational platforms serving millions of learners, yet three critical gaps remain in this domain: (1) latest-generation Large Language Models (LLMs) lack comprehensive evaluation on grammar correction tasks; (2) whether combining these LLMs improves correction quality is unexplored; and (3) the extent to which reference-based metrics underestimate GEC system performance has not been adequately quantified. In this study, first, we evaluate latest-generation LLMs on edit precision, fluency preservation, and meaning retention, showing fine-tuned GPT-4o achieves state-of-the-art performance across all three dimensions. Second, through grammatical error type analysis we demonstrate that individual LLMs exhibit highly similar error correction patterns (). Third, we show that reference-based metrics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
