RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian
Mircea Timpuriu, Mihaela-Claudia Cercel, Dumitru-Clementin Cercel

TL;DR
This paper introduces RoLegalGEC, a novel Romanian legal domain dataset with 350,000 annotated error examples, and evaluates neural models for grammatical error detection and correction in legal texts.
Contribution
It provides the first Romanian legal domain parallel dataset for grammatical error detection and correction, along with an evaluation of multiple neural network models.
Findings
Neural models effectively detect and correct legal grammatical errors.
The dataset enriches resources for Romanian NLP research.
Transformers show promising results in legal text correction.
Abstract
The importance of clear and correct text in legal documents cannot be understated, and, consequently, a grammatical error correction tool meant to assist a professional in the law must have the ability to understand the possible errors in the context of a legal environment, correcting them accordingly, and implicitly needs to be trained in the same environment, using realistic legal data. However, the manually annotated data required by such a process is in short supply for languages such as Romanian, much less for a niche domain. The most common approach is the synthetic generation of parallel data; however, it requires a structured understanding of the Romanian grammar. In this paper, we introduce, to our knowledge, the first Romanian-language parallel dataset for the detection and correction of grammatical errors in the legal domain, RoLegalGEC, which aggregates 350,000 examples of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
