When Classical Chinese Meets Machine Learning: Explaining the Relative Performances of Word and Sentence Segmentation Tasks
Chao-Lin Liu, Chang-Ting Chu, Wei-Ting Chang, and Ti-Yong Zheng

TL;DR
This paper explores the effectiveness of deep learning for classical Chinese text segmentation, analyzing how different training corpora influence performance and providing explanations for observed variations.
Contribution
It demonstrates the viability of deep learning for classical Chinese segmentation and offers insights into how training data selection affects results.
Findings
Deep learning achieves satisfactory segmentation results.
Training corpus relevance influences segmentation performance.
Different corpus combinations yield varying results.
Abstract
We consider three major text sources about the Tang Dynasty of China in our experiments that aim to segment text written in classical Chinese. These corpora include a collection of Tang Tomb Biographies, the New Tang Book, and the Old Tang Book. We show that it is possible to achieve satisfactory segmentation results with the deep learning approach. More interestingly, we found that some of the relative superiority that we observed among different designs of experiments may be explainable. The relative relevance among the training corpora provides hints/explanation for the observed differences in segmentation results that were achieved when we employed different combinations of corpora to train the classifiers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
