Semi-supervised Thai Sentence Segmentation Using Local and Distant Word Representations
Chanatip Saetia, Ekapol Chuangsuwanich, Tawunrat Chalothorn, Peerapon, Vateekul

TL;DR
This paper introduces a semi-supervised deep learning model for Thai sentence segmentation that combines local n-gram embeddings, self-attention for distant context, and unlabeled data to improve accuracy, outperforming baseline models.
Contribution
The paper presents a novel integration of local and distant word representations with semi-supervised learning for Thai sentence segmentation.
Findings
Reduced relative error by 7.4% and 10.5% on Thai datasets
Outperformed prior models on English pronunciation recovery
N-gram embeddings were key for Thai, semi-supervised learning benefited English
Abstract
A sentence is typically treated as the minimal syntactic unit used for extracting valuable information from a longer piece of text. However, in written Thai, there are no explicit sentence markers. We proposed a deep learning model for the task of sentence segmentation that includes three main contributions. First, we integrate n-gram embedding as a local representation to capture word groups near sentence boundaries. Second, to focus on the keywords of dependent clauses, we combine the model with a distant representation obtained from self-attention modules. Finally, due to the scarcity of labeled data, for which annotation is difficult and time-consuming, we also investigate and adapt Cross-View Training (CVT) as a semi-supervised learning technique, allowing us to utilize unlabeled data to improve the model representations. In the Thai sentence segmentation experiments, our model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Bidirectional LSTM · Convolution · CNN Bidirectional LSTM · [LivE@PeRson]How do I talk to a real person at Expedia? · Softmax · Dropout · Cross-View Training
