TL;DR
This paper introduces a training-free, prompt-based approach using large language models to correct all types of Chinese character errors, including missing and redundant characters, achieving high performance without fine-tuning.
Contribution
The paper extends a training-free, prompt-based Chinese spelling correction method to handle all character error types, creating a comprehensive C2EC benchmark and demonstrating strong results with a 14B-parameter LLM.
Findings
The method performs comparably to much larger models on CSC and C2EC tasks.
Incorporating Levenshtein distance improves handling of length changes.
No fine-tuning is required for effective error correction.
Abstract
Chinese spelling correction (CSC) is a crucial task that aims to correct character errors in Chinese text. While conventional CSC focuses on character substitution errors caused by mistyping, two other common types of character errors, missing and redundant characters, have received less attention. These errors are often excluded from CSC datasets during the annotation process or ignored during evaluation, even when they have been annotated. This issue limits the practicality of the CSC task. To address this issue, we introduce the task of General Chinese Character Error Correction (C2EC), which focuses on all three types of character errors. We construct a high-quality C2EC benchmark by combining and manually verifying data from CCTC and Lemon datasets. We extend the training-free prompt-free CSC method to C2EC by using Levenshtein distance for handling length changes and leveraging an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
