Onto Word Segmentation of the Complete Tang Poems
Chao-Lin Liu

TL;DR
This paper develops a Chinese word segmentation method for the Complete Tang Poems, using PMI and biLSTM models, achieving over 20% complete accuracy, which approaches human agreement levels.
Contribution
It introduces a domain-specific word segmentation approach for classical Chinese poetry, combining PMI and biLSTM models, and evaluates performance against human annotations.
Findings
PMI-based segmenter recovers 85.7% of words
biLSTM model segments completely correctly over 20% of the time
Human annotators agree 40% of the time on annotations
Abstract
We aim at segmenting words in the Complete Tang Poems (CTP). Although it is possible to do some research about CTP without doing full-scale word segmentation, we must move forward to word-level analysis of CTP for conducting advanced research topics. In November 2018 when we submitted the manuscript for DH 2019 (ADHO), we collected only 2433 poems that were segmented by trained experts, and used the segmented poems to evaluate the segmenter that considered domain knowledge of Chinese poetry. We trained pointwise mutual information (PMI) between Chinese characters based on the CTP poems (excluding the 2433 poems, which were used exclusively only for testing) and the domain knowledge. The segmenter relied on the PMI information to the recover 85.7% of words in the test poems. We could segment a poem completely correct only 17.8% of the time, however. When we presented our work at DH 2019,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Computational Techniques and Applications
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Bidirectional LSTM
