Building Chinese Lexicons from Scratch by Unsupervised Short Document Self-Segmentation
Daniel Gayo-Avello

TL;DR
This paper introduces a simple, unsupervised algorithm for Chinese text segmentation that works effectively on short documents, enabling lexicon building without prior statistical data or lexicons, and performs well on unseen words.
Contribution
The paper presents a novel unsupervised self-segmentation algorithm tailored for short Chinese texts, capable of building lexicons from scratch without relying on existing statistical models.
Findings
Results comparable to native speakers' segmentation
Effective in identifying new words and proper nouns
Robust for lexicon construction from minimal input
Abstract
Chinese text segmentation is a well-known and difficult problem. On one side, there is not a simple notion of "word" in Chinese language making really hard to implement rule-based systems to segment written texts, thus lexicons and statistical information are usually employed to achieve such a task. On the other side, any piece of Chinese text usually includes segments present neither in the lexicons nor in the training data. Even worse, such unseen sequences can be segmented into a number of totally unrelated words making later processing phases difficult. For instance, using a lexicon-based system the sequence ???(Baluozuo, Barroso, current president-designate of the European Commission) can be segmented into ?(ba, to hope, to wish) and ??(luozuo, an undefined word) changing completely the meaning of the sentence. A new and extremely simple algorithm specially suited to work over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Advanced Text Analysis Techniques · Topic Modeling
