CASICT Tibetan Word Segmentation System for MLWS2017
Jiawei Hu, Qun Liu

TL;DR
This paper presents a Tibetan word segmentation system that combines a baseline model, subword units via BPE, and neural network classification, achieving improved accuracy in the MLWS 2017 challenge.
Contribution
Introduces a novel Tibetan segmentation approach integrating subword units and neural networks, enhancing baseline performance in a low-resource setting.
Findings
Significant performance improvement over baseline
Effective correction of segmentation errors
Utilized a large corpus of 760,000 sentences
Abstract
We participated in the MLWS 2017 on Tibetan word segmentation task, our system is trained in a unrestricted way, by introducing a baseline system and 76w tibetan segmented sentences of ours. In the system character sequence is processed by the baseline system into word sequence, then a subword unit (BPE algorithm) split rare words into subwords with its corresponding features, after that a neural network classifier is adopted to token each subword into "B,M,E,S" label, in decoding step a simple rule is used to recover a final word sequence. The candidate system for submition is selected by evaluating the F-score in dev set pre-extracted from the 76w sentences. Experiment shows that this method can fix segmentation errors of baseline system and result in a significant performance gain.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
