Onto Word Segmentation of the Complete Tang Poems

Chao-Lin Liu

arXiv:1908.10621·cs.CL·August 29, 2019

Onto Word Segmentation of the Complete Tang Poems

Chao-Lin Liu

PDF

Open Access

TL;DR

This paper develops a Chinese word segmentation method for the Complete Tang Poems, using PMI and biLSTM models, achieving over 20% complete accuracy, which approaches human agreement levels.

Contribution

It introduces a domain-specific word segmentation approach for classical Chinese poetry, combining PMI and biLSTM models, and evaluates performance against human annotations.

Findings

01

PMI-based segmenter recovers 85.7% of words

02

biLSTM model segments completely correctly over 20% of the time

03

Human annotators agree 40% of the time on annotations

Abstract

We aim at segmenting words in the Complete Tang Poems (CTP). Although it is possible to do some research about CTP without doing full-scale word segmentation, we must move forward to word-level analysis of CTP for conducting advanced research topics. In November 2018 when we submitted the manuscript for DH 2019 (ADHO), we collected only 2433 poems that were segmented by trained experts, and used the segmented poems to evaluate the segmenter that considered domain knowledge of Chinese poetry. We trained pointwise mutual information (PMI) between Chinese characters based on the CTP poems (excluding the 2433 poems, which were used exclusively only for testing) and the domain knowledge. The segmenter relied on the PMI information to the recover 85.7% of words in the test poems. We could segment a poem completely correct only 17.8% of the time, however. When we presented our work at DH 2019,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Computational Techniques and Applications

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Bidirectional LSTM