Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling
Peijie Jiang, Dingkun Long, Yanzhao Zhang, Pengjun Xie, Meishan Zhang,, Min Zhang

TL;DR
This paper introduces BABERT, an unsupervised boundary-aware pretraining method for Chinese language models, improving sequence labeling tasks without relying on external lexicons.
Contribution
It proposes an unsupervised approach to encode boundary information directly into BERT, enhancing Chinese sequence labeling performance and complementing lexicon-based methods.
Findings
Consistent improvements on ten Chinese sequence labeling benchmarks.
BABERT effectively encodes boundary information without external lexicons.
Combining BABERT with lexicon data yields further gains.
Abstract
Boundary information is critical for various Chinese language processing tasks, such as word segmentation, part-of-speech tagging, and named entity recognition. Previous studies usually resorted to the use of a high-quality external lexicon, where lexicon items can offer explicit boundary information. However, to ensure the quality of the lexicon, great human effort is always necessary, which has been generally ignored. In this work, we suggest unsupervised statistical boundary information instead, and propose an architecture to encode the information directly into pre-trained language models, resulting in Boundary-Aware BERT (BABERT). We apply BABERT for feature induction of Chinese sequence labeling tasks. Experimental results on ten benchmarks of Chinese sequence labeling demonstrate that BABERT can provide consistent improvements on all datasets. In addition, our method can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Layer Normalization · Residual Connection · Dropout · Weight Decay · Adam · Softmax · WordPiece
