On the Sentence Embeddings from Pre-trained Language Models

Bohan Li; Hao Zhou; Junxian He; Mingxuan Wang; Yiming Yang; and Lei Li

arXiv:2011.05864·cs.CL·November 12, 2020·24 cites

On the Sentence Embeddings from Pre-trained Language Models

Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, and Lei Li

PDF

Open Access 3 Repos

TL;DR

This paper analyzes the limitations of BERT sentence embeddings for semantic similarity, revealing their anisotropic nature, and proposes a normalizing flow-based transformation to improve their semantic quality, achieving state-of-the-art results.

Contribution

It uncovers the anisotropic structure of BERT sentence embeddings and introduces BERT-flow, a method to transform embeddings into a more isotropic, Gaussian-like space for better semantic similarity performance.

Findings

01

BERT embeddings are non-smooth and anisotropic.

02

Transforming embeddings improves semantic similarity performance.

03

BERT-flow outperforms previous state-of-the-art methods.

Abstract

Pre-trained contextual representations like BERT have achieved great success in natural language processing. However, the sentence embeddings from the pre-trained language models without fine-tuning have been found to poorly capture semantic meaning of sentences. In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited. We first reveal the theoretical connection between the masked language model pre-training objective and the semantic similarity task theoretically, and then analyze the BERT sentence embeddings empirically. We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity. To address this issue, we propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Dropout · Attention Dropout · Softmax · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dense Connections · WordPiece · Layer Normalization