TL;DR
Augmented SBERT introduces a data augmentation technique using cross-encoders to label additional data, significantly enhancing bi-encoder performance in pairwise sentence scoring tasks, especially in domain adaptation scenarios.
Contribution
The paper proposes a simple, efficient data augmentation method for bi-encoders using cross-encoder labels, improving performance without extensive fine-tuning.
Findings
Up to 6-point improvement in in-domain tasks
Up to 37-point improvement in domain adaptation tasks
Effective sentence pair selection is crucial for success
Abstract
There are two approaches for pairwise sentence scoring: Cross-encoders, which perform full-attention over the input pair, and Bi-encoders, which map each input independently to a dense vector space. While cross-encoders often achieve higher performance, they are too slow for many practical use cases. Bi-encoders, on the other hand, require substantial training data and fine-tuning over the target task to achieve competitive performance. We present a simple yet efficient data augmentation strategy called Augmented SBERT, where we use the cross-encoder to label a larger set of input pairs to augment the training data for the bi-encoder. We show that, in this process, selecting the sentence pairs is non-trivial and crucial for the success of the method. We evaluate our approach on multiple tasks (in-domain) as well as on a domain adaptation task. Augmented SBERT achieves an improvement of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗dangvantuan/vietnamese-embeddingmodel· 208k dl· ♡ 50208k dl♡ 50
- 🤗lengocduc195/SentenceTransformermodel
- 🤗deliciouscat/kf-deberta-base-cross-stsmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗Lajavaness/bilingual-embedding-largemodel· 7.8k dl· ♡ 277.8k dl♡ 27
- 🤗Lajavaness/bilingual-embedding-basemodel· 3.2k dl· ♡ 93.2k dl♡ 9
- 🤗Lajavaness/bilingual-document-embeddingmodel· 194 dl· ♡ 8194 dl♡ 8
- 🤗yco/bilingual-embedding-basemodel· 73 dl73 dl
- 🤗Lajavaness/bilingual-embedding-smallmodel· 6.2k dl· ♡ 86.2k dl♡ 8
- 🤗yco/bilingual-embedding-base-onnxmodel· 13 dl13 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSentence-BERT · Augmented SBERT · Siamese Network · Adam · Dropout · Softmax · BERT
