Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings
Yiren Jian, Chongyang Gao, Soroush Vosoughi

TL;DR
This paper demonstrates that training Transformer-based sentence encoders with multi-modal, multi-task contrastive losses using unpaired non-linguistic data improves semantic sentence representations across multiple benchmarks, making the approach language-agnostic.
Contribution
It introduces a novel multi-modal, multi-task contrastive training framework that leverages unpaired non-linguistic data to enhance sentence embeddings.
Findings
Improved performance on 7 semantic textual similarity benchmarks.
Multi-modal training leads to better generalization of sentence encoders.
The approach is effective across different languages and modalities.
Abstract
Semantic representation learning for sentences is an important and well-studied problem in NLP. The current trend for this task involves training a Transformer-based sentence encoder through a contrastive objective with text, i.e., clustering sentences with semantically similar meanings and scattering others. In this work, we find the performance of Transformer models as sentence encoders can be improved by training with multi-modal multi-task losses, using unpaired examples from another modality (e.g., sentences and unrelated image/audio data). In particular, besides learning by the contrastive loss on text, our model clusters examples from a non-linguistic domain (e.g., visual/audio) with a similar contrastive loss at the same time. The reliance of our framework on unpaired non-linguistic data makes it language-agnostic, enabling it to be widely applicable beyond English NLP.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Softmax · Dropout · Residual Connection · Dense Connections
