Comparing Neural- and N-Gram-Based Language Models for Word Segmentation

Yerai Doval; Carlos G\'omez-Rodr\'iguez

arXiv:1812.00815·cs.CL·December 4, 2018

Comparing Neural- and N-Gram-Based Language Models for Word Segmentation

Yerai Doval, Carlos G\'omez-Rodr\'iguez

PDF

TL;DR

This paper compares byte/character-level language models, including n-gram and neural network approaches, for word segmentation, aiming to improve preprocessing in microtext normalization with a focus on handling data sparsity.

Contribution

It introduces a beam search-based word segmentation system using byte/character-level language models, demonstrating effectiveness over existing tools in microtext contexts.

Findings

01

The neural network model outperforms n-gram in segmentation accuracy.

02

The system surpasses Microsoft Word Breaker and Python WordSegment in key metrics.

03

Effective handling of data sparsity in microtexts was achieved.

Abstract

Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and a language model working at the byte/character level, the latter component implemented either as an n-gram model or a recurrent neural network. The resulting system analyzes the text input with no word boundaries one token at a time, which can be a character or a byte, and uses the information gathered by the language model to determine if a boundary must be placed in the current position or not. Our aim is to use this system in a preprocessing step for a microtext normalization system. This means that it needs to effectively cope with the data sparsity present on this kind of texts. We also strove to surpass the performance of two readily available…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.