TL;DR
MarkBERT is a Chinese BERT model that incorporates word boundary markers to effectively handle out-of-vocabulary words and enrich semantic understanding, improving performance on language tasks.
Contribution
It introduces a novel boundary marker approach that maintains character-level vocabulary while leveraging word information, enabling better OOV handling and semantic integration.
Findings
Improves downstream task performance with boundary markers
Handles OOV words effectively without expanding vocabulary
Easily incorporates richer semantic information like POS tags
Abstract
We present a Chinese BERT model dubbed MarkBERT that uses word information in this work. Existing word-based BERT models regard words as basic units, however, due to the vocabulary limit of BERT, they only cover high-frequency words and fall back to character level when encountering out-of-vocabulary (OOV) words. Different from existing works, MarkBERT keeps the vocabulary being Chinese characters and inserts boundary markers between contiguous words. Such design enables the model to handle any words in the same way, no matter they are OOV words or not. Besides, our model has two additional benefits: first, it is convenient to add word-level learning objectives over markers, which is complementary to traditional character and sentence-level pretraining tasks; second, it can easily incorporate richer semantics such as POS tags of words by replacing generic markers with POS tag-specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Residual Connection · Weight Decay · Layer Normalization · Linear Warmup With Linear Decay · WordPiece
