Self-Supervised Acoustic Word Embedding Learning via Correspondence Transformer Encoder
Jingru Lin, Xianghu Yue, Junyi Ao, Haizhou Li

TL;DR
This paper introduces a self-supervised learning method using a transformer encoder to produce robust acoustic word embeddings that are invariant to speech variations, achieving state-of-the-art results in low-resource cross-lingual settings.
Contribution
It proposes the Correspondence Transformer Encoder, a novel self-supervised framework that learns invariant AWEs from unlabelled speech using a teacher-student approach.
Findings
Embeddings are robust to speaker and domain variations.
Achieves state-of-the-art performance on low-resource cross-lingual tasks.
Effective in learning from unlabelled speech data.
Abstract
Acoustic word embeddings (AWEs) aims to map a variable-length speech segment into a fixed-dimensional representation. High-quality AWEs should be invariant to variations, such as duration, pitch and speaker. In this paper, we introduce a novel self-supervised method to learn robust AWEs from a large-scale unlabelled speech corpus. Our model, named Correspondence Transformer Encoder (CTE), employs a teacher-student learning framework. We train the model based on the idea that different realisations of the same word should be close in the underlying embedding space. Specifically, we feed the teacher and student encoder with different acoustic instances of the same word and pre-train the model with a word-level loss. Our experiments show that the embeddings extracted from the proposed CTE model are robust to speech variations, e.g. speakers and domains. Additionally, when evaluated on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Softmax · Dense Connections · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection
