Universal Text Representation from BERT: An Empirical Study

Xiaofei Ma; Zhiguo Wang; Patrick Ng; Ramesh Nallapati; Bing Xiang

arXiv:1910.07973·cs.CL·October 25, 2019·40 cites

Universal Text Representation from BERT: An Empirical Study

Xiaofei Ma, Zhiguo Wang, Patrick Ng, Ramesh Nallapati, Bing Xiang

PDF

Open Access

TL;DR

This paper systematically investigates BERT's layer-wise activations for general-purpose text representations, evaluating their linguistic information and transferability across tasks, and compares their performance to state-of-the-art models in various NLP tasks.

Contribution

It provides a comprehensive analysis of BERT embeddings at different layers, highlighting how fine-tuning and layer combination improve their quality and transferability across multiple NLP tasks.

Findings

01

Pre-trained BERT embeddings perform poorly on semantic and surface probing tasks.

02

Fine-tuning BERT on natural language inference data significantly enhances embedding quality.

03

BERT embeddings outperform BM25 on factoid QA datasets but not on non-factoid datasets.

Abstract

We present a systematic investigation of layer-wise BERT activations for general-purpose text representations to understand what linguistic information they capture and how transferable they are across different tasks. Sentence-level embeddings are evaluated against two state-of-the-art models on downstream and probing tasks from SentEval, while passage-level embeddings are evaluated on four question-answering (QA) datasets under a learning-to-rank problem setting. Embeddings from the pre-trained BERT model perform poorly in semantic similarity and sentence surface information probing tasks. Fine-tuning BERT on natural language inference data greatly improves the quality of the embeddings. Combining embeddings from different BERT layers can further boost performance. BERT embeddings outperform BM25 baseline significantly on factoid QA datasets at the passage level, but fail to perform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Web Data Mining and Analysis

MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece · Softmax