C-LLM: Learn to Check Chinese Spelling Errors Character by Character
Kunting Li, Yong Hu, Liang He, Fandong Meng, Jie Zhou

TL;DR
This paper introduces C-LLM, a novel character-level Chinese spell checking approach that leverages character-by-character learning to improve accuracy, especially in domain-specific contexts, outperforming existing methods.
Contribution
C-LLM addresses tokenization issues by learning at the character level, simplifying CSC to more effective tasks, and achieves state-of-the-art results on benchmarks.
Findings
10% average improvement over existing methods
2.1% improvement in general scenarios
12% improvement in vertical domain scenarios
Abstract
Chinese Spell Checking (CSC) aims to detect and correct spelling errors in sentences. Despite Large Language Models (LLMs) exhibit robust capabilities and are widely applied in various tasks, their performance on CSC is often unsatisfactory. We find that LLMs fail to meet the Chinese character-level constraints of the CSC task, namely equal length and phonetic similarity, leading to a performance bottleneck. Further analysis reveal that this issue stems from the granularity of tokenization, as current mixed character-word tokenization struggles to satisfy these character-level constraints. To address this issue, we propose C-LLM, a Large Language Model-based Chinese Spell Checking method that learns to check errors Character by Character. Character-level tokenization enables the model to learn character-level alignment, effectively mitigating issues related to character-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
