TransAug: Translate as Augmentation for Sentence Embeddings

Jue Wang

arXiv:2111.00157·cs.CL·June 4, 2025

TransAug: Translate as Augmentation for Sentence Embeddings

Jue Wang

PDF

Open Access

TL;DR

TransAug leverages translated sentence pairs as data augmentation to improve sentence embeddings, using a two-stage process involving cross-lingual contrastive learning and encoder distillation, achieving state-of-the-art results.

Contribution

Introduces TransAug, a novel method utilizing translation-based augmentation and a two-stage training paradigm for enhanced sentence embeddings.

Findings

01

Achieves new state-of-the-art on semantic textual similarity tasks.

02

Outperforms existing models like SimCSE and Sentence-T5.

03

Demonstrates effectiveness on transfer tasks evaluated by SentEval.

Abstract

While contrastive learning greatly advances the representation of sentence embeddings, it is still limited by the size of the existing sentence datasets. In this paper, we present TransAug (Translate as Augmentation), which provide the first exploration of utilizing translated sentence pairs as data augmentation for text, and introduce a two-stage paradigm to advances the state-of-the-art sentence embeddings. Instead of adopting an encoder trained in other languages setting, we first distill a Chinese encoder from a SimCSE encoder (pretrained in English), so that their embeddings are close in semantic space, which can be regraded as implicit data augmentation. Then, we only update the English encoder via cross-lingual contrastive learning and frozen the distilled Chinese encoder. Our approach achieves a new state-of-art on standard semantic textual similarity (STS), outperforming both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsContrastive Learning · SimCSE