On the Sentence Embeddings from Pre-trained Language Models
Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, and Lei Li

TL;DR
This paper analyzes the limitations of BERT sentence embeddings for semantic similarity, revealing their anisotropic nature, and proposes a normalizing flow-based transformation to improve their semantic quality, achieving state-of-the-art results.
Contribution
It uncovers the anisotropic structure of BERT sentence embeddings and introduces BERT-flow, a method to transform embeddings into a more isotropic, Gaussian-like space for better semantic similarity performance.
Findings
BERT embeddings are non-smooth and anisotropic.
Transforming embeddings improves semantic similarity performance.
BERT-flow outperforms previous state-of-the-art methods.
Abstract
Pre-trained contextual representations like BERT have achieved great success in natural language processing. However, the sentence embeddings from the pre-trained language models without fine-tuning have been found to poorly capture semantic meaning of sentences. In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited. We first reveal the theoretical connection between the masked language model pre-training objective and the semantic similarity task theoretically, and then analyze the BERT sentence embeddings empirically. We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity. To address this issue, we propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Dropout · Attention Dropout · Softmax · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dense Connections · WordPiece · Layer Normalization
