Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces
Yu-An Chung, Wei-Hung Weng, Schrasing Tong, James Glass

TL;DR
This paper introduces an unsupervised method to align speech and text embedding spaces, enabling cross-modal tasks like translation and classification, especially beneficial for low-resource languages with limited parallel data.
Contribution
The paper proposes a novel adversarial training framework for unsupervised alignment of speech and text embeddings, extending cross-lingual embedding techniques to the speech-text domain.
Findings
Achieves comparable performance to supervised methods in spoken word classification and translation.
Demonstrates effectiveness for low-resource languages with minimal parallel data.
Provides a foundation for developing ASR and speech translation systems in underrepresented languages.
Abstract
Recent research has shown that word embedding spaces learned from text corpora of different languages can be aligned without any parallel data supervision. Inspired by the success in unsupervised cross-lingual word embeddings, in this paper we target learning a cross-modal alignment between the embedding spaces of speech and text learned from corpora of their respective modalities in an unsupervised fashion. The proposed framework learns the individual speech and text embedding spaces, and attempts to align the two spaces via adversarial training, followed by a refinement procedure. We show how our framework could be used to perform spoken word classification and translation, and the results on these two tasks demonstrate that the performance of our unsupervised alignment approach is comparable to its supervised counterpart. Our framework is especially useful for developing automatic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
