Enhancing Grammatical Error Detection using BERT with Cleaned Lang-8   Dataset

Rahul Nihalani; Kushal Shah

arXiv:2411.15523·cs.CL·November 26, 2024·2 cites

Enhancing Grammatical Error Detection using BERT with Cleaned Lang-8 Dataset

Rahul Nihalani, Kushal Shah

PDF

Open Access 1 Repo 5 Models 1 Datasets

TL;DR

This paper demonstrates that fine-tuning BERT models on a rigorously cleaned Lang-8 dataset significantly improves grammatical error detection performance, achieving an F1 score of 0.91, highlighting the importance of data quality over model size.

Contribution

The study shows that data cleaning combined with transformer models like BERT can outperform larger models and previous approaches in grammatical error detection.

Findings

01

BERT-base-uncased achieved an F1 score of 0.91.

02

Data cleaning was crucial for performance gains.

03

Larger models like BERT-large did not improve results.

Abstract

This paper presents an improved LLM based model for Grammatical Error Detection (GED), which is a very challenging and equally important problem for many applications. The traditional approach to GED involved hand-designed features, but recently, Neural Networks (NN) have automated the discovery of these features, improving performance in GED. Traditional rule-based systems have an F1 score of 0.50-0.60 and earlier machine learning models give an F1 score of 0.65-0.75, including decision trees and simple neural networks. Previous deep learning models, for example, Bi-LSTM, have reported F1 scores within the range from 0.80 to 0.90. In our study, we have fine-tuned various transformer models using the Lang8 dataset rigorously cleaned by us. In our experiments, the BERT-base-uncased model gave an impressive performance with an F1 score of 0.91 and accuracy of 98.49% on training data and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rahuln2002/Grammatical-Error-Detection-GED
pytorchOfficial

Models

Datasets

rahuln2002/GED-lang8-cleaned
dataset· 24 dl
24 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Intelligent Tutoring Systems and Adaptive Learning