BURT: BERT-inspired Universal Representation from Learning Meaningful Segment
Yian Li, Hai Zhao

TL;DR
This paper introduces BURT, a universal language representation model that encodes multiple linguistic levels into a single vector space, improving performance across various NLP tasks and benchmarks.
Contribution
The paper proposes a novel pre-training approach that incorporates multi-level linguistic segments into a unified embedding space, enhancing cross-level language understanding.
Findings
Outperforms baselines on GLUE and CLUE benchmarks
Effective in text matching and question-answering tasks
Universal representations improve retrieval-based NLP applications
Abstract
Although pre-trained contextualized language models such as BERT achieve significant performance on various downstream tasks, current language representation still only focuses on linguistic objective at a specific granularity, which may not applicable when multiple levels of linguistic units are involved at the same time. Thus this work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space. We present a universal representation model, BURT (BERT-inspired Universal Representation from learning meaningful segmenT), to encode different levels of linguistic unit into the same vector space. Specifically, we extract and mask meaningful segments based on point-wise mutual information (PMI) to incorporate different granular objectives into the pre-training stage. We conduct experiments on datasets for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
MethodsLinear Layer · Attention Is All You Need · Dropout · Adam · Multi-Head Attention · WordPiece · Residual Connection · Layer Normalization · Linear Warmup With Linear Decay · Dense Connections
