WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach

Junjie Huang; Duyu Tang; Wanjun Zhong; Shuai Lu; Linjun Shou; Ming; Gong; Daxin Jiang; Nan Duan

arXiv:2104.01767·cs.CL·April 12, 2021

WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach

Junjie Huang, Duyu Tang, Wanjun Zhong, Shuai Lu, Linjun Shou, Ming, Gong, Daxin Jiang, Nan Duan

PDF

Open Access 1 Repo

TL;DR

WhiteningBERT introduces a simple, unsupervised sentence embedding method that improves performance by combining layer outputs and applying a whitening normalization, validated across multiple models and datasets.

Contribution

The paper presents a straightforward whitening-based normalization technique that enhances unsupervised sentence embeddings from pretrained models.

Findings

01

Averaging all token embeddings outperforms using only the [CLS] token.

02

Combining top and bottom layer outputs yields better embeddings.

03

A simple whitening normalization consistently boosts performance.

Abstract

Producing the embedding of a sentence in an unsupervised way is valuable to natural language matching and retrieval problems in practice. In this work, we conduct a thorough examination of pretrained model based unsupervised sentence embeddings. We study on four pretrained models and conduct massive experiments on seven datasets regarding sentence semantics. We have there main findings. First, averaging all tokens is better than only using [CLS] vector. Second, combining both top andbottom layers is better than only using top layers. Lastly, an easy whitening-based vector normalization strategy with less than 10 lines of code consistently boosts the performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Jun-jie-Huang/WhiteningBERT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications