Building Chinese Lexicons from Scratch by Unsupervised Short Document   Self-Segmentation

Daniel Gayo-Avello

arXiv:cs/0411074·cs.CL·May 23, 2007

Building Chinese Lexicons from Scratch by Unsupervised Short Document Self-Segmentation

Daniel Gayo-Avello

PDF

Open Access

TL;DR

This paper introduces a simple, unsupervised algorithm for Chinese text segmentation that works effectively on short documents, enabling lexicon building without prior statistical data or lexicons, and performs well on unseen words.

Contribution

The paper presents a novel unsupervised self-segmentation algorithm tailored for short Chinese texts, capable of building lexicons from scratch without relying on existing statistical models.

Findings

01

Results comparable to native speakers' segmentation

02

Effective in identifying new words and proper nouns

03

Robust for lexicon construction from minimal input

Abstract

Chinese text segmentation is a well-known and difficult problem. On one side, there is not a simple notion of "word" in Chinese language making really hard to implement rule-based systems to segment written texts, thus lexicons and statistical information are usually employed to achieve such a task. On the other side, any piece of Chinese text usually includes segments present neither in the lexicons nor in the training data. Even worse, such unseen sequences can be segmented into a number of totally unrelated words making later processing phases difficult. For instance, using a lexicon-based system the sequence ???(Baluozuo, Barroso, current president-designate of the European Commission) can be segmented into ?(ba, to hope, to wish) and ??(luozuo, an undefined word) changing completely the meaning of the sentence. A new and extremely simple algorithm specially suited to work over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Advanced Text Analysis Techniques · Topic Modeling