TransAug: Translate as Augmentation for Sentence Embeddings
Jue Wang

TL;DR
TransAug leverages translated sentence pairs as data augmentation to improve sentence embeddings, using a two-stage process involving cross-lingual contrastive learning and encoder distillation, achieving state-of-the-art results.
Contribution
Introduces TransAug, a novel method utilizing translation-based augmentation and a two-stage training paradigm for enhanced sentence embeddings.
Findings
Achieves new state-of-the-art on semantic textual similarity tasks.
Outperforms existing models like SimCSE and Sentence-T5.
Demonstrates effectiveness on transfer tasks evaluated by SentEval.
Abstract
While contrastive learning greatly advances the representation of sentence embeddings, it is still limited by the size of the existing sentence datasets. In this paper, we present TransAug (Translate as Augmentation), which provide the first exploration of utilizing translated sentence pairs as data augmentation for text, and introduce a two-stage paradigm to advances the state-of-the-art sentence embeddings. Instead of adopting an encoder trained in other languages setting, we first distill a Chinese encoder from a SimCSE encoder (pretrained in English), so that their embeddings are close in semantic space, which can be regraded as implicit data augmentation. Then, we only update the English encoder via cross-lingual contrastive learning and frozen the distilled Chinese encoder. Our approach achieves a new state-of-art on standard semantic textual similarity (STS), outperforming both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsContrastive Learning · SimCSE
