C-LLM: Learn to Check Chinese Spelling Errors Character by Character

Kunting Li; Yong Hu; Liang He; Fandong Meng; Jie Zhou

arXiv:2406.16536·cs.CL·October 29, 2024

C-LLM: Learn to Check Chinese Spelling Errors Character by Character

Kunting Li, Yong Hu, Liang He, Fandong Meng, Jie Zhou

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces C-LLM, a novel character-level Chinese spell checking approach that leverages character-by-character learning to improve accuracy, especially in domain-specific contexts, outperforming existing methods.

Contribution

C-LLM addresses tokenization issues by learning at the character level, simplifying CSC to more effective tasks, and achieves state-of-the-art results on benchmarks.

Findings

01

10% average improvement over existing methods

02

2.1% improvement in general scenarios

03

12% improvement in vertical domain scenarios

Abstract

Chinese Spell Checking (CSC) aims to detect and correct spelling errors in sentences. Despite Large Language Models (LLMs) exhibit robust capabilities and are widely applied in various tasks, their performance on CSC is often unsatisfactory. We find that LLMs fail to meet the Chinese character-level constraints of the CSC task, namely equal length and phonetic similarity, leading to a performance bottleneck. Further analysis reveal that this issue stems from the granularity of tokenization, as current mixed character-word tokenization struggles to satisfy these character-level constraints. To address this issue, we propose C-LLM, a Large Language Model-based Chinese Spell Checking method that learns to check errors Character by Character. Character-level tokenization enables the model to learn character-level alignment, effectively mitigating issues related to character-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ktlktl/c-llm
pytorchOfficial

Videos

C-LLM: Learn to Check Chinese Spelling Errors Character by Character· underline

Taxonomy

TopicsNatural Language Processing Techniques