Chinese Sequence Labeling with Semi-Supervised Boundary-Aware Language   Model Pre-training

Longhui Zhang; Dingkun Long; Meishan Zhang; Yanzhao Zhang; Pengjun Xie; and Min Zhang

arXiv:2404.05560·cs.CL·April 9, 2024·1 cites

Chinese Sequence Labeling with Semi-Supervised Boundary-Aware Language Model Pre-training

Longhui Zhang, Dingkun Long, Meishan Zhang, Yanzhao Zhang, Pengjun Xie, and Min Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a semi-supervised boundary-aware pre-trained language model for Chinese sequence labeling, enhancing boundary information integration and proposing a new metric to evaluate boundary awareness in PLMs.

Contribution

It develops a semi-supervised boundary-aware PLM by incorporating supervised boundary information into BABERT and proposes a novel metric to evaluate boundary encoding in PLMs.

Findings

01

The improved BABERT outperforms the vanilla version on Chinese sequence labeling tasks.

02

The new metric effectively measures PLMs' boundary awareness without task-specific fine-tuning.

03

Boundary-aware PLMs enhance performance across various Chinese NLP tasks.

Abstract

Chinese sequence labeling tasks are heavily reliant on accurate word boundary demarcation. Although current pre-trained language models (PLMs) have achieved substantial gains on these tasks, they rarely explicitly incorporate boundary information into the modeling process. An exception to this is BABERT, which incorporates unsupervised statistical boundary information into Chinese BERT's pre-training objectives. Building upon this approach, we input supervised high-quality boundary information to enhance BABERT's learning, developing a semi-supervised boundary-aware PLM. To assess PLMs' ability to encode boundaries, we introduce a novel ``Boundary Information Metric'' that is both simple and effective. This metric allows comparison of different PLMs without task-specific fine-tuning. Experimental results on Chinese sequence labeling datasets demonstrate that the improved BABERT variant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

modelscope/adaseq
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques