CASICT Tibetan Word Segmentation System for MLWS2017

Jiawei Hu; Qun Liu

arXiv:1710.06112·cs.CL·October 18, 2017

CASICT Tibetan Word Segmentation System for MLWS2017

Jiawei Hu, Qun Liu

PDF

Open Access 1 Repo

TL;DR

This paper presents a Tibetan word segmentation system that combines a baseline model, subword units via BPE, and neural network classification, achieving improved accuracy in the MLWS 2017 challenge.

Contribution

Introduces a novel Tibetan segmentation approach integrating subword units and neural networks, enhancing baseline performance in a low-resource setting.

Findings

01

Significant performance improvement over baseline

02

Effective correction of segmentation errors

03

Utilized a large corpus of 760,000 sentences

Abstract

We participated in the MLWS 2017 on Tibetan word segmentation task, our system is trained in a unrestricted way, by introducing a baseline system and 76w tibetan segmented sentences of ours. In the system character sequence is processed by the baseline system into word sequence, then a subword unit (BPE algorithm) split rare words into subwords with its corresponding features, after that a neural network classifier is adopted to token each subword into "B,M,E,S" label, in decoding step a simple rule is used to recover a final word sequence. The candidate system for submition is selected by evaluating the F-score in dev set pre-extracted from the 76w sentences. Experiment shows that this method can fix segmentation errors of baseline system and result in a significant performance gain.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rsennrich/subword-nmt
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis