WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach
Junjie Huang, Duyu Tang, Wanjun Zhong, Shuai Lu, Linjun Shou, Ming, Gong, Daxin Jiang, Nan Duan

TL;DR
WhiteningBERT introduces a simple, unsupervised sentence embedding method that improves performance by combining layer outputs and applying a whitening normalization, validated across multiple models and datasets.
Contribution
The paper presents a straightforward whitening-based normalization technique that enhances unsupervised sentence embeddings from pretrained models.
Findings
Averaging all token embeddings outperforms using only the [CLS] token.
Combining top and bottom layer outputs yields better embeddings.
A simple whitening normalization consistently boosts performance.
Abstract
Producing the embedding of a sentence in an unsupervised way is valuable to natural language matching and retrieval problems in practice. In this work, we conduct a thorough examination of pretrained model based unsupervised sentence embeddings. We study on four pretrained models and conduct massive experiments on seven datasets regarding sentence semantics. We have there main findings. First, averaging all tokens is better than only using [CLS] vector. Second, combining both top andbottom layers is better than only using top layers. Lastly, an easy whitening-based vector normalization strategy with less than 10 lines of code consistently boosts the performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
