TiSpell: A Semi-Masked Methodology for Tibetan Spelling Correction covering Multi-Level Error with Data Augmentation

Yutong Liu; Feng Xiao; Ziyue Zhang; Yongbin Yu; Cheng Huang; Fan Gao; Xiangxiang Wang; Ma-bao Ban; Manping Fan; Thupten Tsering; Cheng Huang; Gadeng Luosang; Renzeng Duojie; Nyima Tashi

arXiv:2505.08037·cs.CL·May 15, 2025

TiSpell: A Semi-Masked Methodology for Tibetan Spelling Correction covering Multi-Level Error with Data Augmentation

Yutong Liu, Feng Xiao, Ziyue Zhang, Yongbin Yu, Cheng Huang, Fan Gao, Xiangxiang Wang, Ma-bao Ban, Manping Fan, Thupten Tsering, Cheng Huang, Gadeng Luosang, Renzeng Duojie, Nyima Tashi

PDF

Open Access 1 Repo

TL;DR

TiSpell is a novel semi-masked Tibetan spelling correction model that effectively handles multi-level errors using data augmentation, outperforming existing methods on both simulated and real-world datasets.

Contribution

We introduce a semi-masked model and a data augmentation strategy for Tibetan spelling correction addressing multi-level errors, filling gaps in open datasets and improving correction accuracy.

Findings

01

TiSpell outperforms baseline models in experiments.

02

Our data augmentation improves model robustness.

03

TiSpell matches state-of-the-art performance.

Abstract

Multi-level Tibetan spelling correction addresses errors at both the character and syllable levels within a unified model. Existing methods focus mainly on single-level correction and lack effective integration of both levels. Moreover, there are no open-source datasets or augmentation methods tailored for this task in Tibetan. To tackle this, we propose a data augmentation approach using unlabeled text to generate multi-level corruptions, and introduce TiSpell, a semi-masked model capable of correcting both character- and syllable-level errors. Although syllable-level correction is more challenging due to its reliance on global context, our semi-masked strategy simplifies this process. We synthesize nine types of corruptions on clean sentences to create a robust training set. Experiments on both simulated and real-world data demonstrate that TiSpell, trained on our dataset, outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Yutong-gannis/TiSpell
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis

MethodsFocus