Chinese Sequence Labeling with Semi-Supervised Boundary-Aware Language Model Pre-training
Longhui Zhang, Dingkun Long, Meishan Zhang, Yanzhao Zhang, Pengjun Xie, and Min Zhang

TL;DR
This paper introduces a semi-supervised boundary-aware pre-trained language model for Chinese sequence labeling, enhancing boundary information integration and proposing a new metric to evaluate boundary awareness in PLMs.
Contribution
It develops a semi-supervised boundary-aware PLM by incorporating supervised boundary information into BABERT and proposes a novel metric to evaluate boundary encoding in PLMs.
Findings
The improved BABERT outperforms the vanilla version on Chinese sequence labeling tasks.
The new metric effectively measures PLMs' boundary awareness without task-specific fine-tuning.
Boundary-aware PLMs enhance performance across various Chinese NLP tasks.
Abstract
Chinese sequence labeling tasks are heavily reliant on accurate word boundary demarcation. Although current pre-trained language models (PLMs) have achieved substantial gains on these tasks, they rarely explicitly incorporate boundary information into the modeling process. An exception to this is BABERT, which incorporates unsupervised statistical boundary information into Chinese BERT's pre-training objectives. Building upon this approach, we input supervised high-quality boundary information to enhance BABERT's learning, developing a semi-supervised boundary-aware PLM. To assess PLMs' ability to encode boundaries, we introduce a novel ``Boundary Information Metric'' that is both simple and effective. This metric allows comparison of different PLMs without task-specific fine-tuning. Experimental results on Chinese sequence labeling datasets demonstrate that the improved BABERT variant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
