TL;DR
This paper introduces CARDS, a contrastive learning method that enhances sentence embeddings by using case-augmented positives and retrieved hard negatives, achieving state-of-the-art results in unsupervised settings.
Contribution
It proposes novel case augmentation and hard negative sampling techniques to improve the quality of contrastive learning for sentence embeddings.
Findings
CARDS outperforms previous SOTA methods on STS benchmarks
Case augmentation reduces bias in token embeddings
Hard negative sampling improves embedding discrimination
Abstract
Following SimCSE, contrastive learning based methods have achieved the state-of-the-art (SOTA) performance in learning sentence embeddings. However, the unsupervised contrastive learning methods still lag far behind the supervised counterparts. We attribute this to the quality of positive and negative samples, and aim to improve both. Specifically, for positive samples, we propose switch-case augmentation to flip the case of the first letter of randomly selected words in a sentence. This is to counteract the intrinsic bias of pre-trained token embeddings to frequency, word cases and subwords. For negative samples, we sample hard negatives from the whole dataset based on a pre-trained language model. Combining the above two methods with SimCSE, our proposed Contrastive learning with Augmented and Retrieved Data for Sentence embedding (CARDS) method significantly surpasses the current…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFLIP · Contrastive Learning · SimCSE
