SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
Paul-Ambroise Duquenne, Holger Schwenk, Beno\^it Sagot

TL;DR
SONAR introduces a multilingual, multimodal sentence embedding space that supports 200 languages, enabling high-quality similarity search, translation, and speech-to-text tasks across languages and modalities with competitive performance.
Contribution
SONAR presents a unified fixed-size embedding space for text and speech in 200 languages, outperforming existing models on similarity search and enabling zero-shot translation capabilities.
Findings
Outperforms LASER3 and LabSE on multilingual similarity search
Enables speech and text embeddings in a shared space with high accuracy
Achieves competitive zero-shot translation results with fixed-size representations
Abstract
We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space. Our single text encoder, covering 200 languages, substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. Our encoders outperform existing speech encoders on similarity search tasks. We also provide a text decoder for 200 languages, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations. Our text-to-text results are competitive compared to the state-of-the-art NLLB~1B model, despite the fixed-size bottleneck representation. Our zero-shot speech-to-text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
