Cross-lingual Word Segmentation and Morpheme Segmentation as Sequence Labelling
Yan Shao

TL;DR
This paper introduces a universal character-level sequence labelling approach using bidirectional RNNs with CRFs for cross-lingual word and morpheme segmentation, achieving high accuracy across multiple languages without language-specific tuning.
Contribution
Proposes a universal, language-agnostic sequence labelling system for word and morpheme segmentation using neural networks, evaluated on diverse languages with superior results.
Findings
Achieves high accuracy on all evaluated languages
Outperforms other systems in the shared tasks
Demonstrates effectiveness without language-specific adjustments
Abstract
This paper presents our segmentation system developed for the MLP 2017 shared tasks on cross-lingual word segmentation and morpheme segmentation. We model both word and morpheme segmentation as character-level sequence labelling tasks. The prevalent bidirectional recurrent neural network with conditional random fields as the output interface is adapted as the baseline system, which is further improved via ensemble decoding. Our universal system is applied to and extensively evaluated on all the official data sets without any language-specific adjustment. The official evaluation results indicate that the proposed model achieves outstanding accuracies both for word and morpheme segmentation on all the languages in various types when compared to the other participating systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
